We recently switched to Twingate’s GKE load balancer to use Google’s new Container-native load balancer. The premise was good - LB talks directly to pods and saves an extra network hops, (with classic LB, traffic goes from LB to a GKE node which then, based on iptables
configured by kube-proxy, get routed to the pod) and should perform better, support more features, and in general we’d rather be on google’s maintained side and not on legacy tech.
However, immediately after making the switch, we started noticing short bursts of 502 errors whenever we’d deploy a new release of our services to the cluster. We tracked it down to the following behavior described in the Container-native load balancing through Ingress docs:
502 errors and rejected connections can also be caused by a container that doesn’t handle SIGTERM.
If a container doesn’t explicitly handle SIGTERM, it immediately terminates and stops handling requests. The load balancer continues to send incoming traffic to the terminated container, leading to errors.
Why do we get 502s on pod restarts? #
The legacy load balancer relied on Kubernetes’s kube-proxy to do the routing.
kube-proxy configures the iptables
on all the cluster’s node with rules on how to distribute traffic to nodes.
When the load balancer receives a request, it sends it to a random node on the cluster which then routes it to the pod (which might be on a different node).
kube-proxy is aware of the different pod’s states and when a pod changed state to Terminating
it immediately updates the routing information.
With Container-native load balancing, traffic is routed directly to pods.
This eliminates the extra networking hop but at a cost that it is not aware of the pods state and relies on healthchecks to know when a pod is terminating.
We were getting these 502s bursts because once we deployed a new version, old pods were being terminated and when receiving SIGTERM they’d stop processing new requests. The load balancer, however, would still send them requests until healthcheck fails (it was set to 10s in our case) and it removes it from circulation.
To solve this we need to be able to gracefully terminate our pods - we need some sort of a toggle to tell the pod to start failing its healthcheck while it continues processing other requests regularly for enough time for the load balancer to stop sending traffic its way.
In order to address this issue, we must find a way to gracefully terminate our pods.
This requires some kind of switch that instructs the pod to begin failing its health check, while simultaneously maintaining regular processing of other requests for enough time to allow the load balancer to mark the pod as done and stop sending traffic its way.
To understand how to do this, lets first take a step back and understand Kubernetes’s process for terminating pods…
Whats the termination process for Kubernetes Pod
1. Pod is set to “Terminating” state
The pod is then removed from endpoints list of all services and kube-proxy updates routing rules on all nodes so that they shouldn’t receive traffic.
2. preStop Hook is called
The preStop Hook is a command executed on the containers in the pod.
3. SIGTERM signal is sent to pod
Kubernetes sends a SIGTERM to the containers in the pod to let them know they need to shut down soon.
4. Kubernetes waits for containers to gracefully terminate
Kubernetes wait for a specified time, called termination grace period for containers to gracefully terminate. By default, this period is set to 30 seconds but it can be customized by setting terminationGracePeriodSeconds
value as part of the pod spec:
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
image: busybox
5. SIGKILL signal is sent to pod and it’s removed
If containers are still running after the grace period, they are sent SIGKILL signal and are forcibly removed. Kubernetes then cleans up its objects store.
Gracefully Terminating Django (gunicorn) #
Gunicorn has its own definition for graceful timeout - when it receives a SIGTERM it will give workers a grace period (30s by default) to finish processing the current request they’re processing and exit. In our case we need gunicorn to continue serving requests for some time before shutting the worker down:
- When pod is terminating, toggle health check (we’re using
/health
) view - Wait for 25 seconds (We set the LB to healthcheck every 5s and consider a pod down after 2 consecutive failures so 25s should give it enough time to fail)
- Send SIGTERM to gunicorn
The simplest way to signal Django to start failing the healthcheck is by using a file - /tmp/shutdown
- if the file exists we should start failing the healthcheck.
(We can’t use a variable and\or http call because gunicorn runs multiple workers and doing some multiprocess memory sharing magic is too complex)
So the detailed graceful shutdown process is as follows:
- Kubernetes sets pod to “Terminating state”
- Kubernetes calls preStop hook 2.1. Create a
/tmp/shutdown
file 2.2. Sleep for25s
- enough time for load balancer to refresh - Kubernetes sends SIGTERM to container and gunicorn shuts down workers
Our preStop
hook is pretty simple: (Note that our LB are configured to healthcheck every 5s and remove target if it fails twice so we need to sleep for at least 10s to make sure pod is removed. These settings may differ on your system…)
lifecycle:
preStop:
exec:
command:
- sh
- -c
- echo "shutting down - $(date +%s)" >> /tmp/shutdown && sleep 25
Our Django healthcheck view:
SHUTDOWN_FILE = "/tmp/shutdown" # nosec
def is_shutting_down() -> bool:
return os.path.exists(SHUTDOWN_FILE)
@internal_only_view
def health_check(_request):
if is_shutting_down():
return HttpResponse("Shutting Down...", status=503)
... Some extra healthcheck logic ...
return HttpResponse("OK")
References #
- External Application Load Balancer overview - Google classic load balancer vs. the new container-native ones
- Kubernetes preStop Hook
- Container-native load balancing
Top comments (0)