No exaggeration, unfortunately. As a disclaimer, I will add that this is a really stupid mistake and shows my lack of experience managing auto-scaling deployments. However, it all started with a question with no answer and I feel obliged to share my learnings to help others avoid similar pitfalls.
What is the difference between a Kubernetes cluster using 100x
n1-standard-1
(1 vCPU) VMs VS having 1xn1-standard-96
(vCPU 96), or 6xn1-standard-16
VMs (vCPU 16)?
I asked this question multiple times in Kubernetes community. No one suggested an answer. If you are unsure about the answer, then there is something for you to learn from my experience (or skip to Answer for the impatient). Here it goes:
Premise
I woke up middle of the night with a determination to reduce our infrastructure costs.
We are running a large Kubernetes cluster. "large" is relative of course. In our case that is 600 vCPUs during normal business hours. This number goes double during peak hours and goes to near 0 during some hours of the night.
Invoice for the last month was USD 3,500.
This is already pretty darn good given the computing power that we get, and Google Kubernetes Engine (GKE) made cost management mostly easy:
- We use the least expensive data center (
europe-west2
(London) is ≈15% more expensive thaneurope-west4
(Netherlands)) - We use different machine types for different deployments (memory heavy vs CPU heavy)
- We use Horizontal Pod Autoscaler (HPA) and Custom Metrics to scale deployments
- We use cluster autoscaler (https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler) to scale node pools
- We use preemptible VMs
Using exclusively preemptible VMs is what allows us to keeps the costs low. To illustrate the savings, in case of n1-standard-1
machine type hosted in europe-west4
, the difference between dedicated and preemptible VM is USD 26.73/ month VS USD 8.03/ month. That is 3.25x lower cost. Of course, preemptible VMs have their limitations that you need to familiarise with and counteract, but that is a whole different topic.
With all of the above in place, it felt like we are doing all the right things to keep the costs low. However, I always had a nagging feeling that something is off.
Major red flag 🚩
About that nagging feeling:
Average CPU usage per Node was low (10%-20%). This didn't seem right.
My first thought was that I have misconfigured compute resources. What resource are required depends entirely on the program that you are running. Therefore, the best thing to do is to deploy your program without resource limits, observe how your program behaves during idle/ regular and peak loads, and set requested/ limit resources based on the observed values.
I will illustrate my mistake through an example of a single deployment "admdesl".
In our case, resource requirements are sporadic:
NAME CPU(cores) MEMORY(bytes)
admdesl-5fcfbb5544-lq7wc 3m 112Mi
admdesl-5fcfbb5544-mfsvf 3m 118Mi
admdesl-5fcfbb5544-nj49v 4m 107Mi
admdesl-5fcfbb5544-nkvk9 3m 103Mi
admdesl-5fcfbb5544-nxbrd 3m 117Mi
admdesl-5fcfbb5544-pb726 3m 98Mi
admdesl-5fcfbb5544-rhhgn 83m 119Mi
admdesl-5fcfbb5544-rhp76 2m 105Mi
admdesl-5fcfbb5544-scqgq 4m 117Mi
admdesl-5fcfbb5544-tn556 49m 101Mi
admdesl-5fcfbb5544-tngv4 2m 135Mi
admdesl-5fcfbb5544-vcmjm 22m 106Mi
admdesl-5fcfbb5544-w9dsv 180m 100Mi
admdesl-5fcfbb5544-whwtk 3m 103Mi
admdesl-5fcfbb5544-wjnnk 132m 110Mi
admdesl-5fcfbb5544-xrrvt 4m 124Mi
admdesl-5fcfbb5544-zhbqw 4m 112Mi
admdesl-5fcfbb5544-zs75s 144m 103Mi
Pods that average 5m are "idle": there is a task in the queue for them to process, but we are waiting for some (external) condition to clear before proceeding. In case of this particular deployment, these pods will change between idle/ active state multiple times every minute and spend 70%+ in idle state.
A minute later the same set of pods will look different:
NAME CPU(cores) MEMORY(bytes)
admdesl-5fcfbb5544-lq7wc 152m 107Mi
admdesl-5fcfbb5544-mfsvf 49m 102Mi
admdesl-5fcfbb5544-nj49v 151m 116Mi
admdesl-5fcfbb5544-nkvk9 105m 100Mi
admdesl-5fcfbb5544-nxbrd 160m 119Mi
admdesl-5fcfbb5544-pb726 6m 103Mi
admdesl-5fcfbb5544-rhhgn 20m 109Mi
admdesl-5fcfbb5544-rhp76 110m 103Mi
admdesl-5fcfbb5544-scqgq 13m 120Mi
admdesl-5fcfbb5544-tn556 131m 115Mi
admdesl-5fcfbb5544-tngv4 52m 113Mi
admdesl-5fcfbb5544-vcmjm 102m 104Mi
admdesl-5fcfbb5544-w9dsv 18m 125Mi
admdesl-5fcfbb5544-whwtk 173m 122Mi
admdesl-5fcfbb5544-wjnnk 31m 110Mi
admdesl-5fcfbb5544-xrrvt 91m 126Mi
admdesl-5fcfbb5544-zhbqw 49m 107Mi
admdesl-5fcfbb5544-zs75s 87m 148Mi
Looking at the above I thought that it makes sense to have a configuration such as:
resources:
requests:
memory: '150Mi'
cpu: '20m'
limits:
memory: '250Mi'
cpu: '200m'
This translates to:
- idle pods don't consume more than 20m
- active (healthy) pods peak at 200m
However, when I used this configuration, it made deployments hectic.
admdesl-78fc6f5fc9-xftgr 0/1 Terminating 3 21m
admdesl-78fc6f5fc9-xgbcq 0/1 Init:CreateContainerError 0 10m
admdesl-78fc6f5fc9-xhfmh 0/1 Init:CreateContainerError 1 9m44s
admdesl-78fc6f5fc9-xjf4r 0/1 Init:CreateContainerError 0 10m
admdesl-78fc6f5fc9-xkcfw 0/1 Terminating 0 20m
admdesl-78fc6f5fc9-xksc9 0/1 Init:0/1 0 10m
admdesl-78fc6f5fc9-xktzq 1/1 Running 0 10m
admdesl-78fc6f5fc9-xkwmw 0/1 Init:CreateContainerError 0 9m43s
admdesl-78fc6f5fc9-xm8pt 0/1 Init:0/1 0 10m
admdesl-78fc6f5fc9-xmhpn 0/1 CreateContainerError 0 8m56s
admdesl-78fc6f5fc9-xn25n 0/1 Init:0/1 0 9m6s
admdesl-78fc6f5fc9-xnv4c 0/1 Terminating 0 20m
admdesl-78fc6f5fc9-xp8tf 0/1 Init:0/1 0 10m
admdesl-78fc6f5fc9-xpc2h 0/1 Init:0/1 0 10m
admdesl-78fc6f5fc9-xpdhr 0/1 Terminating 0 131m
admdesl-78fc6f5fc9-xqflf 0/1 CreateContainerError 0 10m
admdesl-78fc6f5fc9-xrqjv 1/1 Running 0 10m
admdesl-78fc6f5fc9-xrrwx 0/1 Terminating 0 21m
admdesl-78fc6f5fc9-xs79k 0/1 Terminating 0 21m
This would happen whenever a new Node is brought in/ out of the cluster (which happens often due to auto-scaling).
As such, I kept increasing requested pod resources until I have ended up with the following configuration for this deployment:
resources:
requests:
memory: '150Mi'
cpu: '100m'
limits:
memory: '250Mi'
cpu: '500m'
With this configuration the cluster was running smoothly, but it meant that even idle Pods were pre-allocated more CPU time than they need. This is the reason why the average CPU usage per Node was low. However, I didn't know what is the solution (reducing requested resources resulted in hectic cluster state/ outages) and as such I rolled with a variation of generous resource allocation for all the deployments.
Answer
Back to my question:
What is the difference between a Kubernetes cluster using 100x
n1-standard-1
(1 vCPU) VMs VS having 1xn1-standard-96
(vCPU 96), or 6xn1-standard-16
VMs (vCPU 16)?
For starters, there is no price-per-vCPU difference between n1-standard-1
and n1-standard-96
. Therefore, I reasoned that using a machine with fewer vCPUs is going to give me more granular control over the price.
The other consideration I had was how fast the cluster will auto-scale, i.e. if there is a sudden surge, how fast can the cluster auto scaler provision new nodes for the unscheduled pods. This was not a concern though – our resource requirements grow and shrink gradually.
And so I went with mostly 1 vCPU nodes, the consequence of which I have described in Premise.
Retrospectively, it was an obvious mistake: distributing pods across nodes with a single vCPU does not allow efficient resource utilisation as individual deployments change between idle and active states. Put it another way, the more vCPUs you have on the same machine, the more tightly you can pack many pods because as a portion of pods go over their required quota, there are readily available resources to take.
What worked:
- I switched to 16 vCPU machines because they provide a balanced solution between fine resource control when auto-scaling the cluster and sufficient resources per machine to enable tight scheduling of pods that are going through idle/ active states.
- I used resource configuration that requests only marginally more than the resources that are needed during an idle state, but have generous limits. It allows to have many pods scheduled on the same machine when majority of the pods are in an idle state, but still allows resource intensive bursts.
- I switched to n2 machine type: n2
machines are more expensive, but they have 2.8 GHz base frequency (compare with ~2.2 GHz available to
n1-*
machines). We are taking an advantage of a higher clock frequency to process resource intensive tasks as fast as possible and put pods into the earlier described idle state as quick as possible.
Current average Node vCPU utilisation is up to 60%. This sounds about right. It will take some to conclude what are the savings. However, today we are already using less than half vCPUs that we used same time yesterday.
Top comments (9)
It'll be interesting to hear what you saved.
One of the most important things I was told earlier in my career was that "idle resources are wasted resources".
It's even more true since we've moved to cloud platforms with wild auto-scaling like this.
Something else that I learned while trying to utilise resources effectively was to try and have minimal number of daemon pods.
Having more number of smaller nodes also results in more resources being allocated to daemon pods as they need to run in every node. We're now trying to reduce the number of nodes as much as other constraints allow us to and have seen lower resources being spent on daemon pods.
What was the changes in invoice after this? Was it significant enough?
This can be a use for many projects
@gajus Can't wait to see the result :)
Also running 100 (1vcpu) nodes that means 100 OS consumption of CPU and mem. and more pod deamons consumption per node. correct me if i'm wrong.
Amazon EKS's default networking-related low pod limits likely mess with this. I can run 35 pods on a 2 vCPU node or 58 on a 4 vCPU node. (A 8 vCPU node also support only 58 pods) (That is on the t3* series, the m* ones are worse)
Thank you so much for taking time to share it, @gajus ! Very useful; we're running production loads where we're heading towards a scale-question scenario and it's great to know this in advance!
Depending on the runtime but 1vCPU might not be enough and could cause problem, for instance not every garbage collector run are blocking but with 1 vCPU they are.
Mostly you can optimize resources by writing software which uses less.