How automatic repair works in Azure Kubernetes Service

#azure #kubernetes #aks #repair

AKS continuously monitors the health of the worker and performs automatic repairs if necessary. Maintenance is performed on virtual machines that are experiencing issues.

Service disruptions for clusters can be minimized with the help of AKS and Azure VMs.

In this article, we'll learn how the automatic repair function works for both Windows and Linux nodes.

How AKS checks for unhealthy nodes

The following rules are used by AKS to determine if there is a problem with nodes.

The NotReady status is reported on consecutive checks.
No status is reported within 10 minutes.

The health state of our nodes can be checked manually by kubectl.

kubectl get nodes

How automatic repair works

AKS takes following actions if it finds an unhealthy node for 10 minutes:

Reboot the node.
If the reboot is not successful, reimage the node.
If the reimage is not successful, redeploy the node.

If auto-repair is unsuccessful, alternative remedies are investigated by AKS engineers.

If AKS finds a lot of unhealthy nodes during a health check, they will repair them individually.

Node Autodrain

There are Scheduled Events that can occur on the underlying virtual machines in any of our node pools. For spot node pools, scheduled events may additionally reason a preempt node match for the node.

Certain events, such as preempt, cause AKS to attempt a cordon and drain of the affected nodes, which allows for a graceful rescheduling of any affected workload on that node.

When this happens, we might notice the node to receive a taint with "remediator.aks.microsoft.com/unschedulable", because of "kubernetes.azure.com/scalesetpriority: spot".

The actions they cause for AKS is shown in the following table.

Event	Description	Action
Freeze	The VM is going to stop for a few seconds. There is no impact on memory or open files when the network is suspended.	No action
Reboot	The VM is going to be reboot. The non-persistent memory is lost.	No action
Redeploy	The VM is going to be redeployed. The ephemeral disks are lost.	Cordon and drain
Preempt	The spot is being deleted. The ephemeral disks are lost.	Cordon and drain
Terminate	The VM is going to be deleted.	Cordon and drain

Limitations

In many cases, AKS can determine if a node is healthy and attempt to fix it, but in some cases, AKS can't detect an issue and can't repair it. AKS can't detect issues if a status is not reported due to an error in the network configuration, or if a node has failed to register as a healthy one.

Thanks for reading my article till end. I hope you learned something special today. If you enjoyed this article then please share to your friends and if you have suggestions or thoughts to share with me then please write in the comment box.

Above blog is submitted as part of 'Devtron Blogathon 2022' - https://devtron.ai/
Check out Devtron's GitHub repo - https://github.com/devtron-labs/devtron/ and give a ⭐ to show your love & support.
Follow Devtron on LinkedIn - https://www.linkedin.com/company/devtron-labs/ and Twitter - https://twitter.com/DevtronL/, to keep yourself updated on this Open Source project.

DEV Community

How automatic repair works in Azure Kubernetes Service

How AKS checks for unhealthy nodes

How automatic repair works

Node Autodrain

Limitations

Top comments (0)