Aks from the trenches - Why zone topology might not be the best solution

#azure #kubernetes

It is Friday around 4 pm, the sun is shining with the temperature hitting around 27 degrees Celsius outside and all your friends are leaving work to go to the beach. And just after the sprint demos are completed and you are about to leave the office, you get that feeling in your stomach where your subconscious has been brewing on a problem that was reported earlier to another team.

Intermittent failure of requests for our customers, with no real availability pattern. You scramble the team for a late Friday afternoon in your effort to settle your stomach before hitting the door.

Running Azure Kubernetes Services without any challenges for more than a year

Last year we decided to make the big move to a more robust service hosting platform, and coming from a single mother load of VM to handle the workload of around 22 different docker images, was a big relief. No more sleepless nights of what happens to that single VM goes down, yay!

At the time our company had a huge focus on operational sustainability and reliability of services, and for that, Azure Kubernetes Services (AKS), had just the tool we needed: Zone-Affinity!

So after spinning up a new AKS Cluster, we configured the node pool consisting of 3 nodes to be spread across Zone 1, Zone 2, and Zone 3 of our local region.

For us to be highly available we wanted to have 3 API services spread across all three zones and thus applied the following spec to our deployments.

spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: {{ $fullname }}
            version: {{ $imageTag }}

This setup has been running for more than a year, and we have been happy… so far.

Nodes running out of Memory

You’ve scrambled your team together and just let that sinking feeling in your stomach out, and the team decides to have a look at the AKS instance…

Dread sets in, at this point we’ve had around 9000 restarting pods, and it seems like Kubernetes does what it is supposed to do which is keep a quorum of running services and rescheduling pods all over our services node pool.

After some additional investigations we saw that each of our nodes in the service node pull was been given the following taint

[node.kubernetes.io/memory-pressure](http://node.kubernetes.io/memory-pressure):NoSchedule

Which basically means, dont schedule more pods on the node.

Additionally, we saw that the taint was only applied to Node 1 (Zone 1), and Node 3(Zone 3), so it made sense that since our topology said “spread 1 application on all three zones” and we’ve had schedule anyway set, we’ve constrained Kubernetes to try forever to schedule workload on Node 2 (Zone 2).

Needless to say, customer tickets were starting to pop up left and right, and since we were unable to schedule the number of pods necessary to sustain throughput, we had to remedy the situation, so our customers would get happy again.

Topology constraints, Yikes.

The first option that came to mind was increasing the number of nodes that were available in the cluster to 4. And so we did..

After we added the additional node to our cluster, we had one additional node In Zone 1 of the Azure region. Unbeknown to us at the time, we ran through our kubectl command for cleaning up the failed pods

kubectl delete pods --field-selector=status.phase==Failed -n umbraco-cloud

Secondly it was manually needed to remove the node taints, and so we did, by this command

kubectl taint nodes <nodename> node.kubernetes.io/memory-pressure:NoSchedule-

To our surprise AKS continued to re-taint our Nodes and kept trying to schedule workload onto Zone 2. At the time the team was a little bewildered, and it wasn’t until we had a look at our topology.

And what we discovered was, oh well if it wasn’t the consequences of our own actions…

What we have discovered is that due to our topology constraints we forced services to be available in all three zones no matter what, and with no room for more pods on a specific node, aks would basically throw the towel in the ring.

What we ended up doing

In order for us to get to the beach with our friends, we ended up increasing the total node count to 6 in order to achieve equilibrium across all three zones in the cluster. As soon as we did that, Kubernetes automatically recovered, automatically removed the taints from the existing nodes, and automatically distributed the workload as expected.

Of course, this cost us a little extra, but the alternative was first to bump our cluster down to 4 nodes, change our helm charts to relax the topology, and then re-release all our services. But doing that on a Friday at 17 PM would have everybody miss out on the beach trip, and I would rather have 6 nodes during the weekend and not worry about it. A job for the future Platform Team…

Closing remarks

This ended up being quite a good team-bonding experience, and one of those issues where quite fun to solve. Additionally, I do not believe that this issue was something we could not foresee when we introduced Aks. It is just one of those things, you’ve to learn the hard way of operating Kubernetes.