Kubernetes v1.27 released in April 2023 came with an exciting announcement - we can now resize pod CPU and memory requests and limits in-place! Without deleting the pod or even restarting the containers!
This happened more than a year ago and since then a lot of folks seem to think this feature is already publicly available or is due to become so tomorrow.
But the reality is that this was originally released as an Alpha feature and since then had no success moving to Beta due to a number of unresolved issues.
Latest status as of June 2024 is that it has been pushed back to v1.32:
Here's the link to that comment on Github.
So first of all - this isn't coming tomorrow. But we can still play with the feature and understand its advantages and shortcomings. Which is exactly what I'm planning to do in this post.
Get a Cluster with Alpha Features
k3d is irreplaceable when we want quickly and cheaply test Kubernetes Alpha features. All we need to do is to pass the correct feature gate to the correct control plane component.
Install k3d
If you still haven't done so - install k3d:
with curl and bash:
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
or with another method of your choice listed here
In our case the component is the API server and the feature gate is called InPlacePodVerticalScaling
as can be seen here
I'm spinning up a single-node cluster with the following config:
cat <<'EOF' | k3d cluster create -c -
apiVersion: k3d.io/v1alpha3
kind: Simple
name: pod-resize
servers: 1
image: rancher/k3s:v1.30.2-k3s2
options:
k3d:
disableLoadbalancer: true
k3s:
extraArgs: # the feature gate is passed here
- arg: --kube-apiserver-arg=feature-gates=InPlacePodVerticalScaling=true
nodeFilters:
- server:*
EOF
The Happy Path - Updating the CPU
Now let's create a pod with one container defining resource requests and limits.
apiVersion: v1
kind: Pod
metadata:
name: stress
spec:
containers:
- image: progrium/stress
args: ["--cpu", "1", "--vm", "1", "--vm-bytes", "128M", "--vm-hang", "3"]
name: stress
resources:
requests:
memory: 150M
cpu: 100m
limits:
memory: 150M
cpu: 100m
You can create the pod with:
kubectl apply -f https://raw.githubusercontent.com/perfectscale-io/inplace-pod-resize/main/guaranteed.yaml
I'm using progrium/stress
and setting it up for slow success by requesting a tenth of the CPU it needs and just enough memory.
stress --vm 1 --vm-bytes 128M --vm-hang 3
- this tells stress to spawn one worker that allocates 128 Mb of memory and then releases them every 3 seconds.
My pod is only currently allowed to have 150M of memory, so I expect it to run fine.
While this 'stress --cpu 1
tells the container to use one whole CPU. While it's actually allowed to only use 0.1 CPU. So it'll surely get throttled.
The container starts just fine:
kubectl get pod
NAME READY STATUS RESTARTS AGE
stress 1/1 Running 0 7s
After a few minutes I can also check its resource consumption by running:
kubectl top pod stress
NAME CPU(cores) MEMORY(bytes)
stress 101m 131Mi
It's running happily, consuming the 101m of CPU and 131M of memory. All within the limits.
Pod QoS Matters
Now let's try to increase our container's limits in-place to give it more resources and see what happens:
kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": { "limits": {"cpu":"300m","memory":"250M"}}}]}}'
Oops! That didn't work!
We're getting:
The Pod "stress" is invalid: metadata: Invalid value: "Guaranteed": Pod QoS is immutable
So what we now know is that while we can change the values of limits and requests - we can't change the pod QoS class. I.e the relationship between the requests and the limits has to stay the same.
Updating the Resources
Let's try to update both the requests and the limits while staying within the Guaranteed QoS:
kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"cpu":"300m","memory": "250M"}, "limits": {"cpu":"300m","memory":"250M"}}}]}}'
pod/stress patched
If we now watch kubectl top pod stress
we will se how the container gradually gets the additional CPU time:
The CGroups Behind the Scenes
Now, being the curious cat that I am - I wanted to check how this works behind the scenes. I know there are cgroups involved in setting container resource restrictions but I like checking myself how stuff works.
The great thing with k3d is it's very easy to get into your nodes with a simple docker exec
.
docker exec -it k3d-pod-resize-server-0 sh
Now I want to find my container and identify the path to its cgroup definition.
Find the container ID using ctr
- the containerd command-line utility:
ctr c ls | grep stress
a4ad15ff9c7a71a0f1c34cdce9d1ae9d18ebd4e7b01f3c92ee796e5180729460 docker.io/progrium/stress:latest io.containerd.runc.v2
and then - find the cgroup information for my container:
ctr c info a4ad15ff9c7a71a0f1c34cdce9d1ae9d18ebd4e7b01f3c92ee796e5180729460 | grep cgroup
which will give me something like:
"destination": "/sys/fs/cgroup",
"type": "cgroup",
"source": "cgroup",
"cgroupsPath": "/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/a4ad15ff9c7a71a0f1c34cdce9d1ae9d18ebd4e7b01f3c92ee796e5180729460",
"type": "cgroup"
The important parts here are /sys/fs/cgroup
where all the cgroup definitions are found and the cgroupsPath
- where the specific constraints for this container are defined.
You'll notice there's a hierarchy there - first we have the pod...
directory and then - the directory named as the container id. This being a single-container pod - all the cgroup values will be featured in the parent folder. So that's where we're going to look.
cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/memory.max
249999360
That's right - 250 Mb of memory in bytes!
cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/cpu.max
30000 100000
An that's correct too! According to the RedHat documentation:
The first value is the allowed time quota in microseconds for which all processes collectively in a child group can run during one period. The second value specifies the length of the period.
During a single period, when processes in a control group collectively exhaust the time specified by this quota, they are throttled for the remainder of the period and not allowed to run until the next period.
Impact on Scheduling
Another thing I wanted to try is update the requests to more than my node can give and check if the scheduler will try to reschedule my pod to another node because the current one doesn't have the needed capacity.
Let's check how many cpus my node has access to:
kubectl get node -ojsonpath="{ .items[].status.allocatable.cpu } cpus"
8 cpus%
I got 8. So let's try to request 10 and see what happens:
kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"cpu": "10"}, "limits": {"cpu":"10"}}}]}}'
pod/stress patched
Alas, while the requests got updated - nothing else happens. Pod doesn't get rescheduled or evicted. Why? No idea.. Have I tried creating it with 10 cpu request from the beginning - it would have stayed pending because there aren't any nodes large enough. So I would expect the pod with requests higher than a node can satisfy to get evicted. But maybe my thinking is flawed?
Negating Resources
Until now all worked fine because we were only adding resources. Everybody likes having more stuff, nobody likes when stuff is taken away from them.
Let's start by taking back the CPU time we granted in the previous section:
kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"cpu":"100m"}, "limits": {"cpu":"100m"}}}]}}'
pod/stress patched
I'm bringing the CPU requests back to 100m. Quite expectedly in a couple of seconds kubectl top
will show me that pod cpu consumption went down to 100m.
And the cgroup cpu.max
file will get updated as expected:
cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/cpu.max
10000 100000
But what if I try to reduce memory?
kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"memory": "150M"}, "limits": {"memory":"150M"}}}]}}
pod/stress patched
Seems to work fine. Checking the cgroups I see the config has been updated:
cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/memory.max
149999616
And what if I need to free even more memory?
kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"memory": "100M"}, "limits": {"memory":"100M"}}}]}}
pod/stress patched
Note that I'm reducing memory to 100M which should cause my container to get OOMKilled. And it seems to work:
kubectl get pod stress -ojsonpath="{ .spec.containers[0].resources }"
{"limits":{"cpu":"100m","memory":"100M"},"requests":{"cpu":"100m","memory":"100M"}}
But I see that the pod continues running!
kubectl get pod
NAME READY STATUS RESTARTS AGE
stress 1/1 Running 0 21m
And checking the cgroup memory.max
file shows why:
cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/memory.max
149999616
The cgroup wasn't updated! Looks like something is getting in our way - protecting the container from getting less memory than it's already using. While this makes sense as a precaution - taking away memory from a running process may lead to irreversible corruption - this now leads to container limits holding an incorrect value which will surely puzzle anyone trying to understand why it's not getting OOMKilled.
I would expect some validating admission hook to tell me that memory can't be reduced. Looks like a bug to me.
Saving Hungry Pods
Ok, we found out that memory being an incompressible resource - we can't really reduce it in-place to a value lower what than the container is already using.
But can we save an OOMing container by giving it more memory?
Let's try that with a similar pod but one that gets only 100M of memory from the get go (while trying to allocate 128):
apiVersion: v1
kind: Pod
metadata:
name: hungry
spec:
containers:
- image: progrium/stress
args: ["--cpu", "1", "--vm", "1", "--vm-bytes", "128M", "--vm-hang", "3"]
name: stress
resources:
requests:
memory: 100M
limits:
memory: 100M
kubectl create -f https://raw.githubusercontent.com/perfectscale-io/inplace-pod-resize/main/hungry.yaml
Quite expectedly the container gets OOMKilled almost instantly:
kubectl get pod hungry
NAME READY STATUS RESTARTS AGE
hungry 0/1 OOMKilled 1 (5s ago) 8s
And it will continue restarting and getting OOMkilled until we update its memory limits. So let's save it from this misery by giving it the memory it needs:
kubectl patch pod hungry -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"memory": "200M"}, "limits": {"memory":"200M"}}}]}}'
pod/hungry patched
This seems to work fine:
kubectl get pod hungry -ojsonpath="{ .spec.containers[0].resources }"
{"limits":{"memory":"200M"},"requests":{"memory":"200M"}}%
But the pod continues getting killed:
kubectl get pod hungry
NAME READY STATUS RESTARTS AGE
hungry 0/1 OOMKilled 4 (33s ago) 60s
And if check the cgroup memory.max
file we'll see why:
cat /sys/fs/cgroup/kubepods/burstable/pod708b8195-0ca0-45e0-9f2b-015f679c98da/memory.max
99999744
Its memory limit never actually got updated!
Why? I wasn't able to find an answer for this one. Why disallow saving containers from getting killed by providing them memory they need? I'm not aware of the technical limitations that would prevent this and I also didn't find anything in the KEP docs
So it looks like the only way to fix the OOMKill is still by deleting the pod and creating a new one with more memory.
Summary
In-place pod resizing is a long awaited feature. Still in alpha since v1.27 it will hopefully make it to beta by v1.32.
If the drawbacks and bugs get fixed.
And here are some of them I found:
- Memory can't be reduced lower than currently used. But there's no notification about that.
- Giving more resources than available on the node doesn't lead to pod eviction (true for both CPU and Memory)
- If a pod is getting OOMKilled - it's not possible to give it more memory to save it from getting killed.
Will these get eventually fixed? I certainly hope so. Will the feature get it to beta by v1.32? Let's keep our fingers crossed.
Something in this post isn't clear or correct? Let me know in the comments.
Thanks for reading and may your pods keep running!
Top comments (0)