Over the last five years, we have set up and managed 7+ Kubernetes clusters with a scale ranging from 5 nodes to 1500 nodes.
Here are the 10 most common mistakes we have encountered while managing them:
1. Low or No CPU requests and limits:
It leads nodes to be overcommitted. In the time of high demand the CPUs of the node are fully utilized and your workload is getting only “what it had requested” and gets CPU throttled, causing increased application latency, timeouts, etc.
2. Overcommitting Memory requests and limits:
It can get you in more trouble. Reaching the CPU limit leads to throttling and reaching the memory limit results in OOMKill.
3. Not setting liveness and readiness probe:
How would your service get restarted when there is an unrecoverable error? How does a load balancer know a specific pod can start handling traffic? Or handle more traffic?
4. Use a load balancer for every service:
If you expose the k8s service as a type: LoadBalancer, its controller (vendor specific) will provision an external LoadBalancer and that might get expensive as you create many of them.
5. Non k8s aware cluster autoscaling:
Imagine there is a new pod to be scheduled but all of the CPU available is requested & the pod is stuck in Pending state. Autoscaler sees the average CPU used (not requested) and won’t scale out (won't add new node). Pod won’t be scheduled.
6. Not using the power of IAM/RBAC:
Don’t use IAM Users with permanent secrets for machines and applications rather than generating temporary ones using roles and service accounts.
7. Self-anti-affinities for pods:
You can’t expect the Kubernetes scheduler to enforce anti-affinities for your pods. You have to define them explicitly.
8. No PodDisruptionBudget:
You run the production workload on Kubernetes. Your nodes & cluster have to be upgraded or decommissioned, from time to time. PodDisruptionBudget (pdb) is sort of an API for service guarantees between cluster administrators and cluster users.
9. More tenants or Environments in a shared cluster:
Kubernetes namespaces don’t provide any strong isolation.
People seem to expect if they separated non-prod workload to one namespace and prod to prod namespace, one workload won’t ever affect the other.
10. Using latest tag:
It makes it difficult to track which version of the image is running and hard to roll back.
Businesses tend to see K8S as a solution to every problem. It is not a silver bullet. If you are not careful, you can end up with a lot of complexity, stress, and a slow control plane. I hope it helps to avoid the most common pitfalls.
Thanks for reading this.
If you have an idea and want to build your product around it, schedule a call with me.
If you want to learn more about DevOps and Backend space, follow me.
If you want to connect, reach out to me on Twitter and LinkedIn.
Top comments (0)