Zero-downtime Kubernetes Ingress Controller swap

#devops #aws #kubernetes

Every cluster external request to an application running in a Kubernetes cluster is handled by a so-called ingress controller. So if it's missing, nothing in the cluster is reachable anymore. So the ingress controller is a very important part of a Kubernetes powered infrastructure.
We needed to exchange the ingress controller of our Kubernetes clusters because the one in use could not satisfy a new requirement. We made it with zero downtime and we were able to develop and test it in production before releasing it to the customers.

TLDR

We did a zero downtime swap of Traefik with Nginx by running both in parallel and making them externally available. Our strategy used the concepts of weight-based routing and header-based routing to implement blue-green deployment and canary release.

Why we needed to switch the ingress controller

We were using Traefik as our ingress controller and were pretty happy with it. At some point, the following requirement popped up: We need to handle a list of around 4500 URLs that need to be redirected to different locations.
At the time of writing (08-01-23) Traefik is only able to have one regex expression per redirect middleware. So we mapped each of these redirects to a Traefik middleware and chained them all together into one chain middleware object. This chain middleware could then be used by Traefik via a Kubernetes ingress annotation:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    traefik.ingress.kubernetes.io/router.middlewares: default@redirect-chain

This setup worked, but we soon saw a huge increase in CPU and memory consumption of the Traefik instances running on production. To handle regular customer traffic we needed multiple pods of 12 GiB memory and 4 CPU. During peak traffic hours this still was not sufficient. So, at this point, we decided to try out a different ingress controller.

Cluster layout before the migration

Nginx was a straightforward choice for a replacement technology, since prior experiences of our team members suggested that it can handle even more redirects without an issue. Some of us had experience with Nginx handling much more redirects without a problem. So we decided to try out the Nginx ingress controller. Our infrastructure layout before the migration looked like this:

As illustrated in the graphic we were using an Application load balancer of AWS with one listener rule routing all incoming requests to a target group called traefik target group. In this target group, all instances of our Traefik ingress controller are registered as targets and will route every incoming request to the correct service in our cluster.

Cluster layout during migration

We wanted to migrate with zero downtime and with high confidence to not break anything, or if we do, we would like to be able to roll back to the old working setup. To archive the former goal, we applied concepts of 'blue-green deployment' and 'canary release'. Our setup for the migration looked like this:

Essentially, we deployed Traefik and Nginx side-by-side Our cluster is now based on two ingress controllers, Traefik and Nginx. We duplicated all of our ingress resources to provide a similar configuration to Nginx as we do for Traefik. Afterward, we created a second target group for our application load balancer where all Nginx ingress controller pods are registered as a target. Then we modified the listener rules of our application load balancer in the following manner:

if the request contains header 'use-nginx' route it to target group nginx target group
default:
- route 100% of all requests to the Traefik target group
- 0% requests to nginx target group

The first rule implements the canary release strategy which enables us to test the behavior of our Nginx setup without interfering with regular customer Traefik on Traefik. Header-based routing is a simple way to make features available to a reduced user group. In our case, this user group consists only of members of the infrastructure and QA team.
The second rule gives us manual control about when to release Nginx as an ingress controller and if errors appear after release, we can easily roll back to the Traefik ingress controller (blue-green deployment). Running two ingress controllers in parallel and making both externally available made our process of migration with zero downtime a breeze.

Cluster layout after migration

After a few days with Nginx live in production, we started to remove the now unused Traefik parts as the traefik target group, traefik deployment, and all ingress resources for Traefik from our infrastructure:

Final thoughts

In my opinion, this approach is pretty nice because it's simple and generally applicable. It is based on standard components rather than complicated ones like service meshes, so every team working on Kubernetes can make use of it.