Ant(on) Weiss

Posted on Aug 1 • Edited on Aug 12 • Originally published at perfectscale.io

We Can Resize Pods without Restarts! Or Can't We?

#kubernetes #devops

Kubernetes v1.27 released in April 2023 came with an exciting announcement - we can now resize pod CPU and memory requests and limits in-place! Without deleting the pod or even restarting the containers!

This happened more than a year ago and since then a lot of folks seem to think this feature is already publicly available or is due to become so tomorrow.

But the reality is that this was originally released as an Alpha feature and since then had no success moving to Beta due to a number of unresolved issues.

Latest status as of June 2024 is that it has been pushed back to v1.32:

Here's the link to that comment on Github.

So first of all - this isn't coming tomorrow. But we can still play with the feature and understand its advantages and shortcomings. Which is exactly what I'm planning to do in this post.

Get a Cluster with Alpha Features

k3d is irreplaceable when we want quickly and cheaply test Kubernetes Alpha features. All we need to do is to pass the correct feature gate to the correct control plane component.

Install k3d

If you still haven't done so - install k3d:
with curl and bash:

curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash

or with another method of your choice listed here

In our case the component is the API server and the feature gate is called InPlacePodVerticalScaling as can be seen here

I'm spinning up a single-node cluster with the following config:

cat <<'EOF' | k3d cluster create -c -
apiVersion: k3d.io/v1alpha3
kind: Simple
name: pod-resize
servers: 1
image: rancher/k3s:v1.30.2-k3s2
options:
  k3d:
    disableLoadbalancer: true
  k3s:
    extraArgs: # the feature gate is passed here
      - arg: --kube-apiserver-arg=feature-gates=InPlacePodVerticalScaling=true
        nodeFilters:
          - server:*
EOF

The Happy Path - Updating the CPU

Now let's create a pod with one container defining resource requests and limits.

apiVersion: v1
kind: Pod
metadata:
  name: stress
spec:
  containers:
  - image: progrium/stress
    args: ["--cpu", "1", "--vm", "1", "--vm-bytes", "128M", "--vm-hang", "3"]
    name: stress
    resources:
      requests:
        memory: 150M
        cpu: 100m
      limits:
        memory: 150M
        cpu: 100m

You can create the pod with:

kubectl apply -f https://raw.githubusercontent.com/perfectscale-io/inplace-pod-resize/main/guaranteed.yaml

I'm using progrium/stress and setting it up for slow success by requesting a tenth of the CPU it needs and just enough memory.

stress --vm 1 --vm-bytes 128M --vm-hang 3 - this tells stress to spawn one worker that allocates 128 Mb of memory and then releases them every 3 seconds.
My pod is only currently allowed to have 150M of memory, so I expect it to run fine.

While this 'stress --cpu 1 tells the container to use one whole CPU. While it's actually allowed to only use 0.1 CPU. So it'll surely get throttled.

The container starts just fine:

kubectl get pod
NAME     READY   STATUS      RESTARTS   AGE
stress   1/1     Running   0         7s

After a few minutes I can also check its resource consumption by running:

kubectl top pod stress
NAME     CPU(cores)   MEMORY(bytes)
stress   101m         131Mi

It's running happily, consuming the 101m of CPU and 131M of memory. All within the limits.

Pod QoS Matters

Now let's try to increase our container's limits in-place to give it more resources and see what happens:

kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": { "limits": {"cpu":"300m","memory":"250M"}}}]}}'

Oops! That didn't work!
We're getting:

The Pod "stress" is invalid: metadata: Invalid value: "Guaranteed": Pod QoS is immutable

So what we now know is that while we can change the values of limits and requests - we can't change the pod QoS class. I.e the relationship between the requests and the limits has to stay the same.

Updating the Resources

Let's try to update both the requests and the limits while staying within the Guaranteed QoS:

kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"cpu":"300m","memory": "250M"}, "limits": {"cpu":"300m","memory":"250M"}}}]}}'
pod/stress patched

If we now watch kubectl top pod stress we will se how the container gradually gets the additional CPU time:

The CGroups Behind the Scenes

Now, being the curious cat that I am - I wanted to check how this works behind the scenes. I know there are cgroups involved in setting container resource restrictions but I like checking myself how stuff works.
The great thing with k3d is it's very easy to get into your nodes with a simple docker exec.

docker exec -it k3d-pod-resize-server-0 sh

Now I want to find my container and identify the path to its cgroup definition.
Find the container ID using ctr - the containerd command-line utility:

ctr c ls | grep stress
a4ad15ff9c7a71a0f1c34cdce9d1ae9d18ebd4e7b01f3c92ee796e5180729460    docker.io/progrium/stress:latest                       io.containerd.runc.v2

and then - find the cgroup information for my container:

ctr c info a4ad15ff9c7a71a0f1c34cdce9d1ae9d18ebd4e7b01f3c92ee796e5180729460 | grep cgroup

which will give me something like:

"destination": "/sys/fs/cgroup",
                "type": "cgroup",
                "source": "cgroup",
            "cgroupsPath": "/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/a4ad15ff9c7a71a0f1c34cdce9d1ae9d18ebd4e7b01f3c92ee796e5180729460",
                    "type": "cgroup"

The important parts here are /sys/fs/cgroup where all the cgroup definitions are found and the cgroupsPath - where the specific constraints for this container are defined.

You'll notice there's a hierarchy there - first we have the pod... directory and then - the directory named as the container id. This being a single-container pod - all the cgroup values will be featured in the parent folder. So that's where we're going to look.

cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/memory.max

249999360

That's right - 250 Mb of memory in bytes!

cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/cpu.max

30000 100000

An that's correct too! According to the RedHat documentation:

The first value is the allowed time quota in microseconds for which all processes collectively in a child group can run during one period. The second value specifies the length of the period.
During a single period, when processes in a control group collectively exhaust the time specified by this quota, they are throttled for the remainder of the period and not allowed to run until the next period.

Impact on Scheduling

Another thing I wanted to try is update the requests to more than my node can give and check if the scheduler will try to reschedule my pod to another node because the current one doesn't have the needed capacity.

Let's check how many cpus my node has access to:

kubectl get node -ojsonpath="{ .items[].status.allocatable.cpu } cpus"
8 cpus%

I got 8. So let's try to request 10 and see what happens:

kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"cpu": "10"}, "limits": {"cpu":"10"}}}]}}'
pod/stress patched

Alas, while the requests got updated - nothing else happens. Pod doesn't get rescheduled or evicted. Why? No idea.. Have I tried creating it with 10 cpu request from the beginning - it would have stayed pending because there aren't any nodes large enough. So I would expect the pod with requests higher than a node can satisfy to get evicted. But maybe my thinking is flawed?

Negating Resources

Until now all worked fine because we were only adding resources. Everybody likes having more stuff, nobody likes when stuff is taken away from them.

Let's start by taking back the CPU time we granted in the previous section:

kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"cpu":"100m"}, "limits": {"cpu":"100m"}}}]}}'
pod/stress patched

I'm bringing the CPU requests back to 100m. Quite expectedly in a couple of seconds kubectl top will show me that pod cpu consumption went down to 100m.
And the cgroup cpu.max file will get updated as expected:

cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/cpu.max
10000 100000

But what if I try to reduce memory?

kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"memory": "150M"}, "limits": {"memory":"150M"}}}]}}
pod/stress patched

Seems to work fine. Checking the cgroups I see the config has been updated:

cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/memory.max
149999616

And what if I need to free even more memory?

kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"memory": "100M"}, "limits": {"memory":"100M"}}}]}}
pod/stress patched

Note that I'm reducing memory to 100M which should cause my container to get OOMKilled. And it seems to work:

kubectl get pod stress -ojsonpath="{ .spec.containers[0].resources }"

{"limits":{"cpu":"100m","memory":"100M"},"requests":{"cpu":"100m","memory":"100M"}}

But I see that the pod continues running!

kubectl get pod
NAME     READY   STATUS    RESTARTS   AGE
stress   1/1     Running   0          21m

And checking the cgroup memory.max file shows why:

cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/memory.max
149999616

The cgroup wasn't updated! Looks like something is getting in our way - protecting the container from getting less memory than it's already using. While this makes sense as a precaution - taking away memory from a running process may lead to irreversible corruption - this now leads to container limits holding an incorrect value which will surely puzzle anyone trying to understand why it's not getting OOMKilled.

I would expect some validating admission hook to tell me that memory can't be reduced. Looks like a bug to me.

Saving Hungry Pods

Ok, we found out that memory being an incompressible resource - we can't really reduce it in-place to a value lower what than the container is already using.

But can we save an OOMing container by giving it more memory?

Let's try that with a similar pod but one that gets only 100M of memory from the get go (while trying to allocate 128):

apiVersion: v1
kind: Pod
metadata:
  name: hungry
spec:
  containers:
  - image: progrium/stress
    args: ["--cpu", "1", "--vm", "1", "--vm-bytes", "128M", "--vm-hang", "3"]
    name: stress
    resources:
      requests:
        memory: 100M
      limits:
        memory: 100M

kubectl create -f https://raw.githubusercontent.com/perfectscale-io/inplace-pod-resize/main/hungry.yaml

Quite expectedly the container gets OOMKilled almost instantly:

kubectl get pod hungry
NAME     READY   STATUS      RESTARTS     AGE
hungry   0/1     OOMKilled   1 (5s ago)   8s

And it will continue restarting and getting OOMkilled until we update its memory limits. So let's save it from this misery by giving it the memory it needs:

kubectl patch pod hungry -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"memory": "200M"}, "limits": {"memory":"200M"}}}]}}'
pod/hungry patched

This seems to work fine:

kubectl get pod hungry -ojsonpath="{ .spec.containers[0].resources }"
{"limits":{"memory":"200M"},"requests":{"memory":"200M"}}%

But the pod continues getting killed:

kubectl get pod hungry
NAME     READY   STATUS      RESTARTS      AGE
hungry   0/1     OOMKilled   4 (33s ago)   60s

And if check the cgroup memory.max file we'll see why:

cat /sys/fs/cgroup/kubepods/burstable/pod708b8195-0ca0-45e0-9f2b-015f679c98da/memory.max
99999744

Its memory limit never actually got updated!
Why? I wasn't able to find an answer for this one. Why disallow saving containers from getting killed by providing them memory they need? I'm not aware of the technical limitations that would prevent this and I also didn't find anything in the KEP docs

So it looks like the only way to fix the OOMKill is still by deleting the pod and creating a new one with more memory.

Summary

In-place pod resizing is a long awaited feature. Still in alpha since v1.27 it will hopefully make it to beta by v1.32.
If the drawbacks and bugs get fixed.
And here are some of them I found:

Memory can't be reduced lower than currently used. But there's no notification about that.
Giving more resources than available on the node doesn't lead to pod eviction (true for both CPU and Memory)
If a pod is getting OOMKilled - it's not possible to give it more memory to save it from getting killed.

Will these get eventually fixed? I certainly hope so. Will the feature get it to beta by v1.32? Let's keep our fingers crossed.

Something in this post isn't clear or correct? Let me know in the comments.

Thanks for reading and may your pods keep running!

DEV Community

We Can Resize Pods without Restarts! Or Can't We?

Get a Cluster with Alpha Features

Install k3d

The Happy Path - Updating the CPU

Pod QoS Matters

Updating the Resources

The CGroups Behind the Scenes

Impact on Scheduling

Negating Resources

Saving Hungry Pods

Summary

Top comments (0)

Read next

End to End CI/CD pipeline using GitHub Actions for Android Application

How to Host Helm Charts on GitHub Container Registry

IBM Buys HashiCorp's Terraform and Vault: The End of Free?

Maximize Your Coding Efficiency with These Sublime Text Plugins 🖥️