An overview of how we manage Kubernetes clusters at Curve to allow for zero downtime upgrades while handling live Curve card transactions.
When I joined Curve as the Lead SRE in January 2019, Kubernetes was already being used in production to manage the many microservices (and few monoliths) that make up the Curve estate. Quite bravely at the time, Curve was also using Istio in production - well before it had wider (aka "stable") adoption.
The clusters were being set up manually by Kops, and deployments happened with Jenkins and a bunch of scripts. This is fine but ideally not something you want to use in a production setup; recreating the cluster, or even just tracking the current version of what's deployed in an environment is a slow and difficult task.
The first step in trying to tackle this mild chaos was to define the current reality of our clusters in one central location - and what better tool to track the state of something than Git. It's scalable, it's got a change history, and all of the Engineering team already know how it works.
We dabbled briefly with ArgoCD but settled eventually on Flux, by WeaveWorks. Its simplicity, and ability to manage Helm charts effectively with the Helm Operator, was a winner for what we wanted to do.
Before Flux, Jenkins managed deployments through a Git repo of its own, but templated in values from the repo with image versions at deploy time, which weren't committed back to the repo.
Additionally, everything defined in that repo was just raw YAML; engineers would copy/paste config from an existing service to define a new one, often copying bits of config that weren't relevant to their new service.
The Platform team started work on a Helm chart that would replace all of that - no more copy-pasting, just add your service name and the version of the image you want to deploy.
A bunch of sensible defaults would be established (resource requests and limits, health checks, rollout strategy), and the Platform team would encourage standardisation of services (ports, metrics endpoints, and so on).
Each service would be defined as a Helm Release; an instantiation of that Platform-managed chart. Values could be added to a release to override some defaults or add optional features, such as setting up an ingress route from the Internet.
With work underway to manage the services as code and standardise deployments with Helm, we began work on replacing Kops with clusters that were also managed by code. We chose to move to AWS's EKS, which we'd set up and configure with Terraform.
The Terraform module we wrote for EKS sets up the infrastructure of course — such as the EKS control plane, the worker nodes, security groups and some IAM roles — but also installs onto the cluster a few components we consider core - Terraform uses Helm to install Istio, Ambassador Edge Stack (our API gateway), and Flux with its Helm Operator.
When the Terraform module is applied, a fully working cluster will start up, with an Istio service mesh, an API gateway with a load balancer for ingress, and Flux preconfigured to connect to the Git repo that defines what should be deployed on that cluster. Flux will take over deploying everything else not deployed by Terraform, including monitoring tools and all of our production services.
Combining the easily Terraform-able EKS cluster, which would start up and deploy all of the services we defined in code with Helm, meant we could easily create, destroy and recreate our environments at will.
That's great for dev environments where we can recreate the cluster often, but how do we upgrade production without causing downtime? Any outage of our services means we decline card transactions, upset customers, and cause the business to lose revenue. We need to do seamless upgrades.
Like the old "cattle not pets" mantra, we decided to treat each of our clusters as something disposable - rather than try risky in-place upgrades of Istio or other core components, we'd simply start a new cluster configured the way we want, and switch.
The key to this was our simple cluster ingress - all customer-facing calls to our APIs go through Ambassador Edge Stack across one load balancer. Each cluster has its own load balancer, set up by Terraform.
We set the EKS Terraform module to output the DNS of the load balancer to remote state, and created another Terraform module to handle weighted routing to those load balancers. This new module would create a fixed Route 53 entry with a CNAME that would resolve to a different load balancer address based on weighting we gave each.
cluster_weighting = [ "cluster-a" = "90", "cluster-b" = "10", ]
For simplicity, we attribute a percentage value between clusters, and handle them as a map in Terraform. In the example above, CNAMES are created for
cluster-b, and 90% of traffic would solve to
cluster-a and reach its load balancer. The fixed Route 53 record that served the weighted load balancer records was then used as the origin for the CloudFront distribution that sits in front of our APIs.
We practiced this process many times in our non-production environments before we made the switch in production, to the point where we had destroyed and recreated dozens of clusters in the months before we went live.
In the end, in a single day we moved all of our API traffic and all card payment transactions from our old Kops cluster to EKS, without dropping a single payment. We stepped up the percentage of traffic gradually at first with weighted routing until all traffic was migrated.
This week we updated to the latest version of EKS and did the same process again, but this time we did the whole thing between morning standup and lunch. We're continuing to refine the process to the point that soon, we will make it fully automated. I'll post more on how that journey goes!