Zero-Downtime Deployments
Every time an application is redeployed on Porter (through a Github action, configuration change, etc), a new set of application instances will replace the old application instances. While the update process will attempt to prevent downtime, there are some additional configuration settings that can be set to prevent any downtime while re-deploying:
Health Checks
Health checks are endpoints that indicate if an application is healthy and ready to receive traffic. When health checks are enabled, traffic won’t switch from the old application instance until the health check indicates that the application is ready. On health check endpoints, the application should return 200
status when the application should receive traffic, and a 500
-level error otherwise. Health checks can be configured in the Advanced tab:
As indicated in the screenshot above, it is recommended that you expose two endpoints on your application: /livez
and /readyz
. If the /livez
endpoint fails, the application will be restarted, while if the /readyz
endpoint fails, the application will stop receiving traffic. These endpoints can sometimes be combined into a single /healthz
endpoint.
The failure threshold and repeat interval control exactly when a health check endpoint is considered as “failed”. The repeat interval refers to the number of seconds between successive requests to the endpoint, while the failure threshold refers to the number of times that endpoint can return a 500
-level response before it’s considered “failed”. Thus, the maximum amount of time the endpoint can return 500
-level responses before the application is failing is failure_threshold * repeat_interval.
Graceful Shutdown
While applications are being re-deployed, the old application instances will receive a termination signal in the form of SIGTERM
. The applications are then given a Termination Grace Period, which is the number of seconds after SIGTERM
is sent before the application will be forcefully killed. In this time, the application should stop receiving traffic, close any existing connections, and exit gracefully. The termination grace period can also be configured from the Advanced tab:
Note that the application will continue to receive traffic until it exits, unless you have configured the readiness probe to fail before this time. Thus, the recommended graceful shutdown sequence is:
- As soon as
SIGTERM
is received, return a500
-level response code on the readiness probe endpoint (see above for more information). - Wait for failure_threshold * repeat_interval seconds. For example, if the failure threshold is 3 and the repeat interval is 5 (the default values), wait 15 seconds. After this amount of time, you are indicating that the application should stop receiving traffic.
- After the wait period, close the server to prevent additional connections, and drain all existing connections before the grace period ends. Exit gracefully after connections are drained.