Roman Belshevitz for Otomato

Posted on Nov 28, 2022

Liveness Probes: Feel the Pulse of the App

#devops #kubernetes #api

This article will provide some helpful examples as the author examines probes in Kubernetes. A correct probe definition can increase pod availability and resilience!

A Kubernetes Liveness Probe: What Is It?

Based on a given test, the Liveness probe makes sure that an application inside a container is active and working.

⚙️ Liveness probes

They are used by the kubelet to determine when to restart a container. Applications that crash or enter broken states are detected and, in many cases, can be rectified by restarting them.

A successful configuration of the liveness probe results in no action being taken and no logs being kept. If it fails, the event is recorded, and the container is killed by the kubelet in accordance with the restartPolicy settings.

When a pod might seem to be running, but the application might not be working properly, a liveness probe should be utilized. During a standstill, as an illustration. The pod might be operational, but it is ineffective since it cannot handle traffic.

🖼️ Pic source: K21Academy

Since the kubelet will check the restartPolicy and restart the container automatically if it is set to Always or OnFailure, they are not required when the application is configured to crash the container on failure. The NGINX application, for example, launches rapidly and shuts down if it encounters a problem that prevents it from serving pages. You are not in need of a liveness inquiry in this instance.

There are common adjustable fields for every type of probe:

initialDelaySeconds: Probes start running after initialDelaySeconds after container is started (default: 0)
periodSeconds: How often probe should run (default: 10)
timeoutSeconds: Probe timeout (default: 1)
successThreshold: Required number of successful probes to mark container healthy/ready (default: 1)
failureThreshold: When a probe fails, it will try failureThreshold times before deeming unhealthy/not ready (default: 3)

The periodSeconds field in each of the examples below says that the kubelet should run a liveness probe every 5 seconds. The initialDelaySeconds field instructs the kubelet to delay the first probe for 5 seconds.

The timeoutSeconds option (Time to wait for the reply), successThreshold (Number of successful probe executions to mark the container healthy), and failiureThreshold (Number of failed probe executions to mark the container unhealthy), among other options, can also be customized, if desired.

All different liveness probes can use these five parameters.

What other Kubernetes probes are available?

Although the use of Liveness probes will be the main emphasis of this article, you should be aware that Kubernetes also supports the following other types of probes:

⚙️ Startup probes

The kubelet uses startup probes to help it determine when a container application has begun. When enabled, these make sure startup probes don't obstruct the application startup by disabling liveness and readiness checks until they are successful.

These are especially helpful for slow-starting containers since they prevent the kubelet from killing them before they have even started when a liveness probe fails. Set the startup probe's failureThreshold greater if liveness probes are used on the same endpoint in order to enable lengthy startup periods.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-api-deployment
spec:
  template:
    metadata:
      labels:
        app: test-api
    spec:
      containers:
      - name: test-api
        image: myrepo/test-api:0.1
        startupProbe:
          httpGet:
            path: /health/startup
            port: 80
          failureThreshold: 30
          periodSeconds: 10

When a Pod starts and the probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness probe means restarting the Pod. In case of readiness probe, the Pod will be marked Unready. Defaults to 3. The minimum value is 1.

Some startup probe's math: why it is important?

0 - 10 s: the container has been spun up but the kubelet doesn't do anything waiting for the initalDelaySeconds to pass
10 - 20 s: the first probe request is sent but no response is sent back, this is because the app hasn’t stood up the APIs yet, this is either a failure due to 2 seconds timeout or an immediate TCP connection error
20 - 30 s: the app has got up but has only started fetching credentials, configurations and so on, so the response to the probe request is 5xx
30 - 210 s: the kubelet has been probing but the success response didn’t come and is reaching the limit set by the failureThreshold. In this case, as per the deployment configuration for the startup probe, the pod will be restarted after roughly 212 seconds.

🖼️ Pic source: Wojciech Sierakowski (HMH Engineering)

It might be a little excessive to wait more than 3 minutes for the app to launch locally with faked dependencies!

🎯 It might be also better to shorten this interval if you are absolutely certain that, for example, reading secrets, credentials, and establishing connections with DBs and other data sources shouldn't take so long. Doing so will slow down the deployment speed.

Maybe it’s important to figure out if you even need more nodes. You don’t want to waste your money on resources you don’t need. Take a look at kubectl top nodes to see if you even need to scale the nodes.

🚧 If probe fails, the event is recorded, and the container is killed by the kubelet in accordance with the restartPolicy settings.

When a container gets restarted you usually want to check the logs why the application went unhealthy. You can do this with the following command:

kubectl logs <pod-name> --previous

⚙️ Readiness probes

Readiness probes keep track of the application's availability. No traffic will be forwarded to the pod if it fails. These are employed when an application requires configuration before it is usable. Additionally, an application may experience traffic congestion and cause the probe to malfunction, stopping further traffic from being routed to it and allowing it to recover. The endpoints controller takes the pod out if it fails.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-api-deployment
spec:
  template:
    metadata:
      labels:
        app: test-api
    spec:
      containers:
      - name: test-api
        image: myrepo/test-api:0.1
        readinessProbe:
          httpGet:
            path: /ready
            port: 80
          successThreshold: 3

The kubelet finds that the container is not yet prepared to receive network traffic, but is making progress in that direction if the readiness probe fails but the liveness probe succeeds.

The operation of Kubernetes probes

The kubelet controls the probes. The main "node agent" that executes on each node is the kubelet.

🖼️ Pic source: Andrew Lock (Datadog). SVG is here

The application needs to support one of the following handlers in order to use a K8S probe effectively:

ExecAction handler: Executes a command inside the container. If the command returns a status code of 0, the diagnosis is successful.
TCPSocketAction handler tries to establish a TCP connection to the pod's IP address on a particular port. If the port is discovered to be open, the diagnostic is successful.

Using the IP address of the pod, a particular port, and a predetermined destination, the HTTPGetAction handler sends an HTTP GET request. If the response code given falls between 200 and 399, the diagnostic is successful.

Before version 1.24 Kubernetes did not support gRPC health checks natively. This left the gRPC developers with the following three approaches when they deploy to Kubernetes:

🖼️ Pic source: Ahmet Alp Balkan (Twitter, ex-Google)

As of Kubernetes version 1.24, gRPC handler can be configured to be used by kubelet for application liveness checks if your application implements the gRPC Health Checking Protocol. To configure checks that use gRPC, you must enable the GRPCContainerProbe feature gate.

When the kubelet conducts a probe on a container, it answers with Success, Failure, or Unknown, depending on whether the diagnostic was successful, unsuccessful, or incomplete for some other reason.

So, how rushy to track the pulse?

You should examine the system behavior and typical starting timings of the pod and its containers before defining a probe so that you can choose the appropriate thresholds. Additionally, as the infrastructure or application changes, the probe choices should be changed. For instance, a pod's configuration to use more system resources can have an impact on the values that need to be configured for the probes.

Handlers in action: some examples

`ExecAction` handler: how can it be useful in practice?

🎯 It allows you to use commands inside containers to control the status of life of a counter in pods. With the help of this option, you may examine several aspects of container's operation, such as the existence of files, their contents, and other choices (accessible at the command level).

ExecAction is executed in pod’s shell context and is deemed failed if the execution returns any result code different from 0 (zero).

The example below demonstrates how to use the exec command with the cat command to see if a file exists at the path /usr/share/liveness/html/index.html.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  containers:
  - name: liveness
    image: registry.k8s.io/liveness:0.1
    ports:
    - containerPort: 8080
    livenessProbe:
      exec:
        command:
        - cat
        - /usr/share/liveness/html/index.html
      initialDelaySeconds: 5
      periodSeconds: 5

🚧 The container will be restarted if there is no file and the liveness probe will fail.

`TCPSocketAction` handler: how can it be useful in practice?

In this use case, the liveness probe makes use of the TCP handler to determine whether port 8080 is active and open. With this configuration, your container will try to connect to the kubelet by opening a socket on the designated port.

apiVersion: v1
kind: Pod
metadata:
  name: liveness
  labels:
    app: liveness-tcp
spec:
  containers:
  - name: liveness
    image: registry.k8s.io/liveness:0.1
    ports:
    - containerPort: 8080
    livenessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5

🚧 The container will restart if the socket is dead and liveness probe fails.

`HTTPGetAction` handler: how can it be useful in practice?

This case demonstrates the HTTP handler that will send an HTTP GET request to the /health path on port 8080. A value between 200 and 400 indicates that the probe was successful.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-http
spec:
  containers:
  - name: liveness
    image: registry.k8s.io/liveness:0.1
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: ItsAlive
      initialDelaySeconds: 5
      periodSeconds: 5

🚧 The probe fails, and the container is restarted if a code outside this range is received. Any custom headers you want to transmit can be defined using the httpHeaders option.

gRPC handler: how can it be useful in practice?

gRPC protocol is on its way to becoming the lingua franca for communication between cloud-native microservices. If you are deploying gRPC applications to Kubernetes today, you may be wondering about the best way to configure health checks.

This example demonstrates how to check port 2379 responsiveness using the gRPC health checking protocol. A port must be specified in order to use a gRPC probe. You must also specify the service if the health endpoint is set up on a non-default service.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-gRPC
spec:
  containers:
  - name: liveness
    image: registry.k8s.io/liveness:0.1
    ports:
    - containerPort: 2379
    livenessProbe:
      grpc:
        port: 2379
      initialDelaySeconds: 5
      periodSeconds: 5

🚧 The container will restart if the gRPC socket is dead and liveness probe fails.

Since there are no error codes for gRPC built-in probes, all errors are regarded as probe failures.

Using liveness probes in the wrong way can lead to disaster

Please remember that the container will be restarted if the liveness probe fails. It is not conventional to examine dependencies in a liveness probe, unlike a readiness probe.

To determine whether the container itself has stopped responding, a liveness probe should be utilized.

A liveness probe has the drawback of maybe not verifying the service's responsiveness. For instance, if a service maintains two web servers, one for service routes and the other for status routes, such as readiness and liveness probes or metrics gathering, the service may be delayed or inaccessible while the liveness probe route responds without any issues. The liveness probe must use the service in a comparable way to dependent services for it to be effective.

Like the readiness probe, it's crucial to take into account dynamics that change over time. A slight increase in response time, possibly brought on by a brief rise in load, could force the container to restart if the liveness-probe timeout is too short. The restart might put even more strain on the other pods supporting the service, leading to a further cascade of liveness probe failures and worsening the service's overall availability.

🖼️ Pic source: Wojciech Sierakowski (HMH Engineering)

These cascade failures can be prevented by configuring liveness probe timeouts on the order of client timeouts and employing a forgiving failureThreshold count.

Liveness probes may have a small issue with the container startup latency varying over time (see above about the math). Changes in resource allocation, network topology changes, or just rising load as your service grows could all contribute to this.

If the initialDelaySeconds option is insufficient and a container is restarted as a result of a Kubernetes node failure or a liveness probe failure, the application may never start or may start partially before being repeatedly destroyed and restarted. The container's maximum initialization time should be greater than the initialDelaySeconds option.

Some notable suggestions are:

Keep dependencies out of liveness probes. Liveness probes should be reasonably priced and have consistent response times.
So that system dynamics can alter temporarily or permanently without causing an excessive number of liveness probe failures, liveness probe timeouts should be conservatively set. Consider setting client timeouts and liveness-probe timeouts to the same value.
To ensure that containers can be restarted with reliability even if starting dynamics vary over time, the initialDelaySeconds option should be set conservatively.

The inevitable summary

By causing an automatic restart of a container after a failure of a particular test is discovered, the proper integration of liveness probes with readiness and startup probes can increase pod resilience and availability. It is necessary to comprehend the application in order to specify the appropriate alternatives for them.

The author is thankful to Guy Menachem from Komodor for inspiration! Stable applications in the clouds to you all, folks!

DEV Community

Liveness Probes: Feel the Pulse of the App

A Kubernetes Liveness Probe: What Is It?

⚙️ Liveness probes

What other Kubernetes probes are available?

⚙️ Startup probes

Some startup probe's math: why it is important?

⚙️ Readiness probes

The operation of Kubernetes probes

So, how rushy to track the pulse?

Handlers in action: some examples

`ExecAction` handler: how can it be useful in practice?

`TCPSocketAction` handler: how can it be useful in practice?

`HTTPGetAction` handler: how can it be useful in practice?

gRPC handler: how can it be useful in practice?

Using liveness probes in the wrong way can lead to disaster

Some notable suggestions are:

The inevitable summary

More to read:

Top comments (0)

Read next

🚀 Week 1 Recap: Learning in Public – Software Engineering with DevOps 🚀

Cycle Doodle - Web Game Powered by DevCycle

Understanding the MLOps Lifecycle

This API Client is More Secure and Better Than Postman

A Kubernetes Liveness Probe: What Is It?

⚙️ Liveness probes

What other Kubernetes probes are available?

⚙️ Startup probes

Some startup probe's math: why it is important?

⚙️ Readiness probes

The operation of Kubernetes probes

So, how rushy to track the pulse?

Handlers in action: some examples

ExecAction handler: how can it be useful in practice?

TCPSocketAction handler: how can it be useful in practice?

HTTPGetAction handler: how can it be useful in practice?

gRPC handler: how can it be useful in practice?

Using liveness probes in the wrong way can lead to disaster

Some notable suggestions are:

The inevitable summary

More to read:

Read next

🚀 Week 1 Recap: Learning in Public – Software Engineering with DevOps 🚀

Cycle Doodle - Web Game Powered by DevCycle

Understanding the MLOps Lifecycle

This API Client is More Secure and Better Than Postman

`ExecAction` handler: how can it be useful in practice?

`TCPSocketAction` handler: how can it be useful in practice?

`HTTPGetAction` handler: how can it be useful in practice?