Ivan Porta

Posted on Nov 14 • Originally published at faun.pub

Zero Downtime Deployments in Kubernetes with Linkerd

#kubernetes

Releasing applications into production always comes with a sense of nervousness, no matter how stable the application and automation have been in prior environments. The fear of causing unexpected disruptions for critical clients — and potentially driving them away — is a significant risk. As a result, many businesses still rely on manual, after-hours deployments, following long instruction pages that detail every manual step. While this may minimize the immediate business impact, it remains vulnerable to human error. Engineers can get distracted or miss a crucial step, and even a minor oversight can lead to significant issues — something automated procedures are designed to prevent.

On the other hand, some companies fully embrace the “fail-fast” approach. For example, Netflix runs its “Simian Army” in production environments to ensure everything is functioning as expected, with their little monkeys trying to break things. However, reaching this level of confidence requires organizational maturity, and it takes time to get there.

Modern production deployment strategies are evolving to address these challenges through automation, continuous delivery, and the use of advanced tools that implement deployment techniques such as blue-green deployments, canary releases, and progressive rollouts. These strategies not only reduce downtime but also ensure smoother, more reliable transitions for production workloads. Convincing management to adopt these approaches may take time, but a proof of concept (POC) and an adoption plan can help your organization achieve this while saving both engineers and management from sleepless, stressful nights.

In this article, I will explain and demonstrate how to implement modern deployment strategies like canary deployment, A/B testing, and blue-green deployment in Kubernetes environments using Linkerd.

Traffic Management and Linkerd

Kubernetes natively supports traffic management features like timeouts, retries, and mirroring through the Gateway API’s HTTPRoute resource. This resource defines rules and matching conditions to determine which backend services should handle incoming traffic. By using the weight field, you can specify the proportion of requests sent to a particular backend, facilitating traffic splitting across different versions or environments.

Before version 2.14, Linkerd users had to rely on a custom resource definition (CRD) downstream of httproutes.gateway.networking.k8s.io, specifically httproutes.policy.linkerd.io, to instruct the Linkerd proxy on how to route requests. Starting with version 2.14, Linkerd extended its support to the native httproutes.gateway.networking.k8s.io. This means that regardless of which resource you use, the Linkerd proxy will route traffic based on either the Gateway API's or Linkerd's policy HTTPRoute resource. This functionality also applies to gRPC requests.

Note: By default, during installation, Linkerd attempts to install the Gateway API CRDs. However, if they are already present in the cluster, you can instruct Linkerd to skip this step by setting enableHttpRoutes to false in the Helm chart or CLI when installing Linkerd CRDs.

$ kubectl get crds | grep gateway
grpcroutes.gateway.networking.k8s.io       2024-09-25T01:01:18Z
httproutes.gateway.networking.k8s.io       2024-09-25T01:01:18Z

In this demonstration, I’ll use NGINX as the Ingress controller. By default, the NGINX Ingress controller retrieves the Endpoint resources for services specified in the Ingress and forwards traffic directly to the IP addresses of the pods. However, this behavior doesn’t align with HTTPRoute policy, which applies to traffic routed through the service itself. To solve this issue, we need to configure the NGINX Ingress controller to forward traffic to the service rather than directly to the pod endpoints. This can be achieved by adding the annotation nginx.ingress.kubernetes.io/service-upstream: "true" to the Ingress resource.

Additionally, since is the Linkerd proxy that handles the redirection of the traffic to the backend service, we need to inject it into the Ingress controller pod.

The overall traffic flow is the following:

The user sends a request to the application.
The inbound traffic is intercepted by the Linkerd proxy running in the Ingress controller pod and then forwarded to the NGINX Ingress controller for processing.
Due to the annotation nginx.ingress.kubernetes.io/service-upstream: "true", the Ingress controller forwards the traffic to the service defined in the upstream configuration located at /etc/nginx/nginx.conf.
The outbound traffic is intercepted again by the Linkerd proxy, which evaluates the destination based on its in-memory state, which includes discovery results, requests, and connections retrieved from the Linkerd destination service. Unused cached entries are evicted after a certain timeout period.
Once the target is determined, the proxy queries the Linkerd policy service for applicable routing policies and applies them as necessary.
Finally, the Linkerd proxy forwards the request to the backend defined by the policy — in this case, the canary version of the service.

Now that we have an idea of what’s happening behind the scenes, let’s dig into the different types of deployment available and what should be expected in the future.

Canary Deployment

This deployment strategy involves deploying a new version of the service (referred to as the “canary”) alongside the current stable version running in production. A percentage of traffic is then redirected to the canary. By doing this, the development team can quickly test the service with production traffic and identify any issues with a minimal “blast radius” (the number of users affected by the change). During this triage phase, the team also collects key metrics from the service. Based on these results, they can decide to gradually increase traffic to the new version (e.g., 25%, 75%, 100%) or, if necessary, abort the release.

Below is an example of an HTTPRoute configuration using Kubernetes Gateway API to implement a canary deployment where the traffic targeting the service projects-vastaya-svc is split between two services:

projects-vastaya-svc: Receives 10% of the traffic.
projects-canary-vastaya-svc: Receives 90% of the traffic.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: project-vastaya-split
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
  - backendRefs:
    - name: projects-vastaya-svc
      port: 80
      weight: 10
    - name: projects-canary-vastaya-svc
      port: 80
      weight: 90

In the following image, you can see traffic being forwarded to both services by the Linkerd proxy. To visualize the inbound traffic to the services, I used the viz extension in Linkerd and injected Linkerd into both the canary and stable deployments. This allowed me to observe the traffic distribution using the command:

linkerd viz top deploy/projects-canary-vastaya-dplmt -n vastaya

**Note: **Injecting the Linkerd proxy into the destination pods is not required for traffic redirection, but I did it to collect detailed metrics on service performance.

Blue-green Deployment

A Blue-Green deployment is similar to a canary deployment but takes a more drastic approach. Instead of gradually directing an incremental percentage of traffic to the new version, both the old (Blue) and new (Green) versions run in parallel. However, only one version is active and accessible to users at any given time.

The key difference is that the new version (Green) remains inactive and hidden from users while you make any necessary adjustments to ensure it’s stable and reliable. Once you’re confident in the new version’s performance, you swap all traffic over to it in a single, coordinated switch. This approach minimizes downtime and allows for a quick rollback if issues are detected.

In contrast to canary deployments — where users actively access both versions as traffic is incrementally shifted — the Blue-Green strategy keeps the new version isolated until it’s fully ready for production use.

In our case, we’ll implement a Blue-Green deployment by changing the traffic weight from 0 to 1, directing all traffic to the new version. Here’s an example of the HTTPRoute configuration:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: project-vastaya-split
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
  - backendRefs:
    - name: projects-vastaya-svc
      port: 80
      weight: 0
    - name: projects-canary-vastaya-svc
      port: 80
      weight: 1

A/B Testing

A/B testing is a method of experimentation that involves running two versions of the same environment to collect metrics like conversion rates, performance, and user engagement. Similar to canary deployments, it allows you to compare different versions of a service, but with a focus on gathering specific data from targeted user groups.

In A/B testing, the second version (the “B” version) targets one or more groups of users defined by predetermined criteria such as location, device type, user behavior, or other factors. This method is widely used in user experience (UX) design. For example, you might notice something appearing on your Netflix dashboard that your friend doesn’t see, or subtle changes in the application’s interface.

In our case, we can achieve this by adding additional filters to our HTTPRoute. In the following configuration we will:

Use the matches section to identify requests coming from users who have their locale set to Korean (Accept-Language: ko.*) and are using Firefox as their web browser (User-Agent: .*Firefox.*).
For these users, traffic is split evenly between the stable service (projects-vastaya-svc) and the canary service (projects-canary-vastaya-svc), each receiving 50% of the traffic.
For all other users, traffic is directed entirely to the stable service.

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: project-vastaya-traffic-split
  namespace: vastaya
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
    - matches:
      - headers:
        - name: "User-Agent"
          type: RegularExpression
          value: ".*Firefox.*"
        - name: Accept-Language
          type: RegularExpression
          value: "en-US.*" 
      backendRefs:
        - name: projects-vastaya-svc
          port: 80
          weight: 10
        - name: projects-canary-vastaya-svc
          port: 80
          weight: 90
    - backendRefs:
        - name: projects-vastaya-svc
          port: 80

By implementing this configuration, you can conduct A/B testing by routing 50% of the targeted users to the canary version while the rest continue to use the stable version. This allows you to collect specific metrics and assess the performance of the new version among a defined user segment.

Shadow Deployment (Mirrored Deployment)

In shadow deployment, also known as mirrored deployment, a new version of a service runs in the background and receives a copy of real-world traffic. Users are not impacted because only the response from the main (stable) service is considered; responses from the new version are ignored. This method allows the development team to test the new service against production traffic to observe how it behaves under real-world conditions without affecting users.

As of now, this feature is not fully supported by Linkerd, but the development team is actively working on it. You can track the progress through this GitHub issue: Linkerd Issue #11027.

Once this feature becomes available, you’ll be able to apply the following configuration without setting up a gateway, and the Linkerd proxy will handle the rest. The traffic sent to the service projects-vastaya-svc will be mirrored to projects-canary-vastaya-svc, but only the response from projects-vastaya-svc will be considered by the users.

Here’s an example of the HTTPRoute configuration:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: project-vastaya-traffic-split
  namespace: vastaya
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
  - backendRefs:
    - name: projects-vastaya-svc
      port: 80
      weight: 0
    filters:
    - type: RequestMirror
      requestMirror:
        backendRef:
          name: projects-canary-vastaya-svc
          port: 80

References

Netflix and Canary Deployments: https://netflixtechblog.com/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69
Linkerd 2.14 Release Notes: https://github.com/linkerd/linkerd2/releases/tag/stable-2.14.0
Ingress configuration with Linkerd: https://linkerd.io/2.16/tasks/using-ingress/#nginx-community-version
Gateway API Traffic Splitting: https://gateway-api.sigs.k8s.io/guides/traffic-splitting/
Netflix A/B Testing: https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15
Proxy Discovery Cache: https://linkerd.io/2.16/tasks/configuring-proxy-discovery-cache/

DEV Community

Zero Downtime Deployments in Kubernetes with Linkerd

Traffic Management and Linkerd

Canary Deployment

Blue-green Deployment

A/B Testing

References

Top comments (0)

Read next

Managing AWS EKS with Terraform

Modern Traffic Management with Gateway API in Kubernetes

Use KEDA Scaling Modifiers to Manage AI Infrastructure

How to Set up Disk and Bandwidth Limits in Docker