This article will provide some helpful examples as the author examines probes in Kubernetes. A correct probe definition can increase pod availability and resilience!
A Kubernetes Liveness Probe: What Is It?
Based on a given test, the Liveness probe makes sure that an application inside a container is active and working.
⚙️ Liveness probes
They are used by the kubelet
to determine when to restart a container. Applications that crash or enter broken states are detected and, in many cases, can be rectified by restarting them.
A successful configuration of the liveness probe results in no action being taken and no logs being kept. If it fails, the event is recorded, and the container is killed by the kubelet
in accordance with the restartPolicy
settings.
When a pod might seem to be running, but the application might not be working properly, a liveness probe should be utilized. During a standstill, as an illustration. The pod might be operational, but it is ineffective since it cannot handle traffic.
Since the kubelet
will check the restartPolicy
and restart the container automatically if it is set to Always
or OnFailure
, they are not required when the application is configured to crash the container on failure. The NGINX application, for example, launches rapidly and shuts down if it encounters a problem that prevents it from serving pages. You are not in need of a liveness inquiry in this instance.
There are common adjustable fields for every type of probe:
-
initialDelaySeconds
: Probes start running after initialDelaySeconds after container is started (default: 0) -
periodSeconds
: How often probe should run (default: 10) -
timeoutSeconds
: Probe timeout (default: 1) -
successThreshold
: Required number of successful probes to mark container healthy/ready (default: 1) -
failureThreshold
: When a probe fails, it will try failureThreshold times before deeming unhealthy/not ready (default: 3)
The periodSeconds
field in each of the examples below says that the kubelet
should run a liveness probe every 5 seconds. The initialDelaySeconds
field instructs the kubelet
to delay the first probe for 5 seconds.
The timeoutSeconds
option (Time to wait for the reply), successThreshold
(Number of successful probe executions to mark the container healthy), and failiureThreshold
(Number of failed probe executions to mark the container unhealthy), among other options, can also be customized, if desired.
All different liveness probes can use these five parameters.
What other Kubernetes probes are available?
Although the use of Liveness probes will be the main emphasis of this article, you should be aware that Kubernetes also supports the following other types of probes:
⚙️ Startup probes
The kubelet
uses startup probes to help it determine when a container application has begun. When enabled, these make sure startup probes don't obstruct the application startup by disabling liveness and readiness checks until they are successful.
These are especially helpful for slow-starting containers since they prevent the kubelet
from killing them before they have even started when a liveness probe fails. Set the startup probe's failureThreshold
greater if liveness probes are used on the same endpoint in order to enable lengthy startup periods.
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-api-deployment
spec:
template:
metadata:
labels:
app: test-api
spec:
containers:
- name: test-api
image: myrepo/test-api:0.1
startupProbe:
httpGet:
path: /health/startup
port: 80
failureThreshold: 30
periodSeconds: 10
When a Pod starts and the probe fails, Kubernetes will try failureThreshold
times before giving up. Giving up in case of liveness probe means restarting the Pod. In case of readiness probe, the Pod will be marked Unready
. Defaults to 3
. The minimum value is 1
.
Some startup probe's math: why it is important?
- 0 - 10 s: the container has been spun up but the
kubelet
doesn't do anything waiting for theinitalDelaySeconds
to pass - 10 - 20 s: the first probe request is sent but no response is sent back, this is because the app hasn’t stood up the APIs yet, this is either a failure due to 2 seconds timeout or an immediate TCP connection error
- 20 - 30 s: the app has got up but has only started fetching credentials, configurations and so on, so the response to the probe request is 5xx
- 30 - 210 s: the kubelet has been probing but the success response didn’t come and is reaching the limit set by the
failureThreshold
. In this case, as per the deployment configuration for the startup probe, the pod will be restarted after roughly 212 seconds.
🖼️ Pic source: Wojciech Sierakowski (HMH Engineering)
It might be a little excessive to wait more than 3 minutes for the app to launch locally with faked dependencies!
🎯 It might be also better to shorten this interval if you are absolutely certain that, for example, reading secrets, credentials, and establishing connections with DBs and other data sources shouldn't take so long. Doing so will slow down the deployment speed.
Maybe it’s important to figure out if you even need more nodes. You don’t want to waste your money on resources you don’t need. Take a look at kubectl top
nodes to see if you even need to scale the nodes.
🚧 If probe fails, the event is recorded, and the container is killed by the kubelet
in accordance with the restartPolicy
settings.
When a container gets restarted you usually want to check the logs why the application went unhealthy. You can do this with the following command:
kubectl logs <pod-name> --previous
⚙️ Readiness probes
Readiness probes keep track of the application's availability. No traffic will be forwarded to the pod if it fails. These are employed when an application requires configuration before it is usable. Additionally, an application may experience traffic congestion and cause the probe to malfunction, stopping further traffic from being routed to it and allowing it to recover. The endpoints controller takes the pod out if it fails.
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-api-deployment
spec:
template:
metadata:
labels:
app: test-api
spec:
containers:
- name: test-api
image: myrepo/test-api:0.1
readinessProbe:
httpGet:
path: /ready
port: 80
successThreshold: 3
The kubelet
finds that the container is not yet prepared to receive network traffic, but is making progress in that direction if the readiness probe fails but the liveness probe succeeds.
The operation of Kubernetes probes
The kubelet
controls the probes. The main "node agent" that executes on each node is the kubelet
.
🖼️ Pic source: Andrew Lock (Datadog). SVG is here
The application needs to support one of the following handlers in order to use a K8S probe effectively:
-
ExecAction
handler: Executes a command inside the container. If the command returns a status code of0
, the diagnosis is successful. -
TCPSocketAction
handler tries to establish a TCP connection to the pod's IP address on a particular port. If the port is discovered to be open, the diagnostic is successful.
Using the IP address of the pod, a particular port, and a predetermined destination, the HTTPGetAction
handler sends an HTTP GET
request. If the response code given falls between 200
and 399
, the diagnostic is successful.
Before version 1.24 Kubernetes did not support gRPC health checks natively. This left the gRPC developers with the following three approaches when they deploy to Kubernetes:
🖼️ Pic source: Ahmet Alp Balkan (Twitter, ex-Google)
As of Kubernetes version 1.24, gRPC handler can be configured to be used by kubelet
for application liveness checks if your application implements the gRPC Health Checking Protocol. To configure checks that use gRPC, you must enable the GRPCContainerProbe
feature gate.
When the kubelet
conducts a probe on a container, it answers with Success
, Failure
, or Unknown
, depending on whether the diagnostic was successful, unsuccessful, or incomplete for some other reason.
So, how rushy to track the pulse?
You should examine the system behavior and typical starting timings of the pod and its containers before defining a probe so that you can choose the appropriate thresholds. Additionally, as the infrastructure or application changes, the probe choices should be changed. For instance, a pod's configuration to use more system resources can have an impact on the values that need to be configured for the probes.
Handlers in action: some examples
ExecAction
handler: how can it be useful in practice?
🎯 It allows you to use commands inside containers to control the status of life of a counter in pods. With the help of this option, you may examine several aspects of container's operation, such as the existence of files, their contents, and other choices (accessible at the command level).
ExecAction
is executed in pod’s shell context and is deemed failed if the execution returns any result code different from 0
(zero).
The example below demonstrates how to use the exec
command with the cat command to see if a file exists at the path /usr/share/liveness/html/index.html
.
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
containers:
- name: liveness
image: registry.k8s.io/liveness:0.1
ports:
- containerPort: 8080
livenessProbe:
exec:
command:
- cat
- /usr/share/liveness/html/index.html
initialDelaySeconds: 5
periodSeconds: 5
🚧 The container will be restarted if there is no file and the liveness probe will fail.
TCPSocketAction
handler: how can it be useful in practice?
In this use case, the liveness probe makes use of the TCP handler to determine whether port 8080
is active and open. With this configuration, your container will try to connect to the kubelet
by opening a socket on the designated port.
apiVersion: v1
kind: Pod
metadata:
name: liveness
labels:
app: liveness-tcp
spec:
containers:
- name: liveness
image: registry.k8s.io/liveness:0.1
ports:
- containerPort: 8080
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
🚧 The container will restart if the socket is dead and liveness probe fails.
HTTPGetAction
handler: how can it be useful in practice?
This case demonstrates the HTTP handler that will send an HTTP GET request to the /health
path on port 8080
. A value between 200
and 400
indicates that the probe was successful.
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-http
spec:
containers:
- name: liveness
image: registry.k8s.io/liveness:0.1
livenessProbe:
httpGet:
path: /health
port: 8080
httpHeaders:
- name: Custom-Header
value: ItsAlive
initialDelaySeconds: 5
periodSeconds: 5
🚧 The probe fails, and the container is restarted if a code outside this range is received. Any custom headers you want to transmit can be defined using the httpHeaders
option.
gRPC handler: how can it be useful in practice?
gRPC protocol is on its way to becoming the lingua franca for communication between cloud-native microservices. If you are deploying gRPC applications to Kubernetes today, you may be wondering about the best way to configure health checks.
This example demonstrates how to check port 2379
responsiveness using the gRPC health checking protocol. A port must be specified in order to use a gRPC probe. You must also specify the service if the health endpoint is set up on a non-default service.
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-gRPC
spec:
containers:
- name: liveness
image: registry.k8s.io/liveness:0.1
ports:
- containerPort: 2379
livenessProbe:
grpc:
port: 2379
initialDelaySeconds: 5
periodSeconds: 5
🚧 The container will restart if the gRPC socket is dead and liveness probe fails.
Since there are no error codes for gRPC built-in probes, all errors are regarded as probe failures.
Using liveness probes in the wrong way can lead to disaster
Please remember that the container will be restarted if the liveness probe fails. It is not conventional to examine dependencies in a liveness probe, unlike a readiness probe.
To determine whether the container itself has stopped responding, a liveness probe should be utilized.
A liveness probe has the drawback of maybe not verifying the service's responsiveness. For instance, if a service maintains two web servers, one for service routes and the other for status routes, such as readiness and liveness probes or metrics gathering, the service may be delayed or inaccessible while the liveness probe route responds without any issues. The liveness probe must use the service in a comparable way to dependent services for it to be effective.
Like the readiness probe, it's crucial to take into account dynamics that change over time. A slight increase in response time, possibly brought on by a brief rise in load, could force the container to restart if the liveness-probe timeout is too short. The restart might put even more strain on the other pods supporting the service, leading to a further cascade of liveness probe failures and worsening the service's overall availability.
🖼️ Pic source: Wojciech Sierakowski (HMH Engineering)
These cascade failures can be prevented by configuring liveness probe timeouts on the order of client timeouts and employing a forgiving failureThreshold
count.
Liveness probes may have a small issue with the container startup latency varying over time (see above about the math). Changes in resource allocation, network topology changes, or just rising load as your service grows could all contribute to this.
If the initialDelaySeconds
option is insufficient and a container is restarted as a result of a Kubernetes node failure or a liveness probe failure, the application may never start or may start partially before being repeatedly destroyed and restarted. The container's maximum initialization time should be greater than the initialDelaySeconds
option.
Some notable suggestions are:
- Keep dependencies out of liveness probes. Liveness probes should be reasonably priced and have consistent response times.
- So that system dynamics can alter temporarily or permanently without causing an excessive number of liveness probe failures, liveness probe timeouts should be conservatively set. Consider setting client timeouts and liveness-probe timeouts to the same value.
- To ensure that containers can be restarted with reliability even if starting dynamics vary over time, the initialDelaySeconds option should be set conservatively.
The inevitable summary
By causing an automatic restart of a container after a failure of a particular test is discovered, the proper integration of liveness probes with readiness and startup probes can increase pod resilience and availability. It is necessary to comprehend the application in order to specify the appropriate alternatives for them.
The author is thankful to Guy Menachem from Komodor for inspiration! Stable applications in the clouds to you all, folks!
More to read:
- Traefik docs
- Kubernetes API reference
- Guy's post
Top comments (0)