Maintaining highly available applications in a Kubernetes cluster can be hard, especially when nodes go under maintenance or fail. The application pods get terminated or may reschedule on other nodes, which can cause downtime or data loss.
To make the applications available 24*7 even during system disruptions, a feature was introduced in Kubernetes known as Pod Disruption Budget. But before that, let’s dive into the different disruptions a system can face.
Disruption, in general, means to break a process, and in terms of Kubernetes, pod disruption means to terminate the pod running on a cluster node if a node fails/upgrades or someone destroys it.
There are two types of disruptions:
- Involuntary Disruptions
- Voluntary Disruptions
These disruptions are unavoidable and occur mainly due to hardware or software errors. Some of its examples are:
- Hardware failure of the node.
- The cluster admin deletes the node accidentally.
- Kernel-related problem.
- Cloud provider or hypervisor-related failure makes VM disappear.
- Node gets disappear due to cluster network partitions.
- Not enough resources left on a node.
These disruptions occur by the application owner or cluster administrator.
An application owner can do the following:
- Deletes the deployment/controller managing the pods.
- Updating the deployment pod’s template causes a restart.
- Accidentally deleting a pod.
A cluster administrator can do the following:
- Draining a node for upgrade and maintenance.
- Drain the node to scale down the cluster.
- Removing pod from node to schedule some other pod on it.
These are some of the disruptions which take applications down and give users downtime. But let’s see how to deal with these disruptions.
To deal with involuntary disruptions, one can follow the below solutions:
- Make sure your pod requests the resources it needs and not more.
- Create enough replicas of your application to make it more available(HA).
- Even in HA, use anti-affinity to have pods on all cluster nodes or zone if using multi-zone clusters.
Similarly, to deal with voluntary disruptions mainly caused by cluster admin actions such as draining a node for maintenance and scaling down a cluster. One can use Pod Disruption Budget (PDB) to make applications always available.
Pod Disruption Budget (PDB) is an object created by the application owner that defines the minimum number of application replicas that should run during voluntary disruptions (node upgrade/maintenance) to make it highly available.
Let’s understand with an example, Let’s say you are running a deployment with 6 replicas and have created a PDB which should have 4 replicas to run always in case of voluntary disruptions. Then the eviction API will allow the disruption of two pods at a time.
Below are the following features of PDB:
- The application owner creates the PDBs.
- It helps the operations team to manage the cluster while the application is always available.
- Provides an interface between cluster admin and application owner to work smoothly.
- Eviction API respects it.
- It defines the availability requirements.
- It works on Deployment, ReplicaSet, ReplicationController, and StatefulSet objects.
There are three main fields in PDB:
.spec.selectorin PDB denotes the set of pods on which it is applied. It is the same as the application controller's label selectors.
.spec.minAvailabledenotes the number of pods that must be available after eviction. It can be an absolute number or percentage.
.spec.maxUnavailabledenotes the number of pods that can be unavailable after eviction. It can be an absolute number or percentage.
One can only specify the minAvailable or maxUnavailable field in a single pod disruption budget, not both.
Let’s see in more detail about PDB by creating a deployment and PDB on a multi-node cluster and draining one node.
- You can deploy a local multi-node cluster with Kind or use managed Kubernetes services. I have used the EKS cluster for this demo.
kubectl get nodes
- PDB is a namespaced-scope resource and belongs to api group policy with the v1 version.
kubectl api-resources | grep pdb
- Below is the nginx-deployment yaml configuration with 8 replicas and app:nginx as label selector.
# nginx-deploy.yaml apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deploy labels: app: nginx spec: replicas: 8 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:alpine ports: - containerPort: 80
kubectl apply -f nginx-deploy.yaml kubectl get deploy
- Verify pods are scheduled on both nodes
- Create a PDB for the above deployment with the minAvailable field set to 5 and app:nginx as the label selector.
# pdb1.yaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: pdb-minavail spec: minAvailable: 5 selector: matchLabels: app: nginx
kubectl apply -f pdb1.yaml kubectl get pdb kubectl describe pdb pdb-minavail
- Now, let’s drain the node and see pdb working
kubectl drain <node_name> --ignore-daemonsets --delete-emptydir-data
It tries to evict the pod and retries until all get evicted. Also, my node is now SchedulingDisabled. All the pods running on it are drained.
- Let’s check the pod status and verify whether they have been rescheduled.
kubectl get pods -o wide
Here 4 pods were running on each node, and when I drained the node, pods got evicted and started rescheduling. One of the pods got rescheduled on other node, and three were not as insufficient resources left on the node and get into pending state.
But my node draining gets completed successfully as the PDB requirement was fulfilled to run 5 application pods. The eviction API respected PDB in this voluntary disruption.
- Now uncordon the node to make the rest of the pods schedulable.
- Now, increase the existing pdb minAvailable field to 6 and drain the node to see what happens.
# pdb1.yaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: pdb-minavail spec: minAvailable: 6 selector: matchLabels: app: nginx
kubectl apply -f pdb1.yaml kubectl get pdb
- Once again, drain the cluster node.
k drain <node_name> --ignore-daemonsets --delete-emptydir-data
You will observe that the drain will not be completed, and eviction API will retry to evict pods until it can reschedule on another node and throws an error: cannot evict pod as it would violate the pod’s disruption budget.
But why has this happened? Because the minimum available pods in pdb are 6 and the other node can only schedule 5 pods according to its resource capacity. As mentioned, eviction API gives PDB priority, so to make a minimum of 6 pods available, it will not drain the node and run the pods on it.
Although my node will mark SchedulingDisabled, it's not drained.
- Delete the pdb and uncordon the node.
kubectl delete pdb pdb-minavail
- Now, create another pdb resource with maxUnavailable set to 3.
# pdb2.yaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: pdb-maxunavail spec: maxUnavailable: 3 selector: matchLabels: app: nginx
kubectl apply -f pdb2.yaml kubectl describe pdb pdb-maxunavail
- Follow the same steps of draining a node to check pdb working
k drain <node_name> --ignore-daemonsets --delete-emptydir-data
- The rest of the pods remain unschedulable. Uncordon the node now.
Reduce the maxUnavailable field to 2, which makes 6 pods run all the time. Now if you drain the node, use case 2 scenario will happen. Node draining will be incomplete, and pods will not be evicted completely by giving weightage to PDB.
Now, what if I set maxUnavailable to 0? This is equal to the setting of minAvailable to 100%. It will be ensured that none of your pods will be disrupted when voluntary disruptions occur.
There are certain cases when pdb cannot be used, such as:
- It doesn’t work for involuntary disruptions
- In voluntary disruptions, it will not work when pods or deployments get deleted.
- Two PDBs can not work together on the same resource.
- PDBs don’t work on a single pod or replica of deployment.
PDBs are useful when we want applications always available, even at cluster maintenance and upgrade times.