Intro
Why are so many software development organizations adopting agile in recent years? Probably, because it allows a much better fit between the R&D and the ever changing dynamic business goals. Agile allows you to deliver new functionality fast, doing it in small iterations that are easier to review, test and validate automatically, reducing the overall risk of delivering new functionality and taking the human factor and manual gating out of the equiation.
In short, it allows us to iterate fast without breaking stuff. Building a Continuous Delivery with automated gates, checks and rollback ability builds confidence and allows us to innovate faster and creates a tighter feedback loop.
Key Components
Linkerd
Linkerd is a layer-7 proxy used as an abstraction layer for communication between components in our system, it also moves common logic from our code to a central configurable control-plane like timeouts, retries and more. It allows the system to dynamically make a decision when service A is trying to communicate with service B. The decision may have to do with enforcing a security policy, or, like in our case, with routing traffic to specific service.
Read more about service-mesh here
FluxCD
Flux is a GitOps system (under CNCF) that helps us keep our cluster configs and deployments in sync across multiple environments through a simple git repository. GitOps keeps the flow you are familiar with like code-reviews to streamline the process.
Flux can also pull data from a container registry, getting information about new images from registries and deploying them automatically.
Note: we use FluxCD for deployments but you can use any other way to trigger deployments on Kubernetes and it will have the same outcome together with Flagger.
Flagger
Flagger is a Kubernetes operator for automating the promotion of canary deployments with progressive traffic shifting. It can leverage a number of proxies for traffic shifting like Linkerd, Istio, Nginx etc.
Flagger also runs canary-analysis before each promotion step using Prometheus and other metrics sources including webhooks. The analysis process can run tests, check for elevated error rate, high latency and other requirements that you consider for a healthy deployment. These indicators will help flagger decide whether to continue the promotion or triggering a rollback.
Prometheus
We use Prometheus as our main monitoring system, which also plays a huge role in our canary-analysis process. Flagger will query it periodically to gain insights about the service it's trying to promote.
Config Snippets
Linkerd (Service-Mesh)
# install linkerd on the target cluster
linkerd install | kubectl apply -f -
Flux
# kustomization.yaml
---
namespace: flux
bases:
- github.com/fluxcd/flux//deploy
patchesStrategicMerge:
- patch.yaml
# patch.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: flux
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
spec:
containers:
- name: flux
args:
- --listen-metrics=:3031
- --manifest-generation=true
- --memcached-hostname=memcached.flux
- --memcached-service=
- --ssh-keygen-dir=/var/fluxd/keygen
- --ssh-keygen-bits=521
- --ssh-keygen-type=ed25519
- --git-url=git@github.com:<orgName>/<kubernetes-config-repo>
- --git-branch=master
- --git-path=production
- --git-user=flux
- --git-poll-interval=5m
- --sync-interval=5m
- --sync-timeout=2m
- --sync-garbage-collection=true
# take the generates ssh key and put it as deploy key in your Github repo.
kubectl -n flux get secret flux-git-deploy -o json | jq -r .data.identity | pbcopy
Repository
Setting up the repo should be up to you, we are using kustomize for manifest generation but one can leverage Helm or other tools for that matter.
/.flux.yaml
/base
/podinfo
/production
/podinfo
# .flux.yaml
---
version: 1
patchUpdated:
generators:
- command: kubectl kustomize .
patchFile: flux-patch.yaml
Flagger
kubectl apply -k github.com/weaveworks/flagger//kustomize/linkerd?ref=0.23.0
Now to the real fun part:
We will run podinfo container and will update it to test the canary rollout.
kubectl create ns test
kubectl annotate namespace test linkerd.io/inject=enabled
kubectl apply -k github.com/weaveworks/flagger//kustomize/tester
# production/kustomization.yaml
---
namespace: flux
bases:
- github.com/weaveworks/flagger//kustomize/podinfo
resources:
- podinfo/canary.yaml
patchesStrategicMerge:
- podinfo/patch.yaml
# production/podinfo/patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: podinfo
spec:
template:
spec:
containers:
- name: podinfod
image: stefanprodan/podinfo:3.1.0
# production/podinfo/canary.yaml
apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
name: podinfo
namespace: test
spec:
# deployment reference
targetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
# HPA reference (optional)
autoscalerRef:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
name: podinfo
# the maximum time in seconds for the canary deployment
# to make progress before it is rollback (default 600s)
progressDeadlineSeconds: 60
service:
# ClusterIP port number
port: 9898
# container port number or name (optional)
targetPort: 9898
canaryAnalysis:
# schedule interval (default 60s)
interval: 30s
# max number of failed metric checks before rollback
threshold: 5
# max traffic percentage routed to canary
# percentage (0-100)
maxWeight: 50
# canary increment step
# percentage (0-100)
stepWeight: 5
# Linkerd Prometheus checks
metrics:
- name: request-success-rate
# minimum req success rate (non 5xx responses)
# percentage (0-100)
threshold: 99
interval: 1m
- name: request-duration
# maximum req duration P99
# milliseconds
threshold: 500
interval: 30s
# testing (optional)
webhooks:
- name: acceptance-test
type: pre-rollout
url: http://flagger-loadtester.test/
timeout: 30s
metadata:
type: bash
cmd: "curl -sd 'test' http://podinfo-canary.test:9898/token | grep token"
- name: load-test
type: rollout
url: http://flagger-loadtester.test/
metadata:
cmd: "hey -z 2m -q 10 -c 2 http://podinfo-canary.test:9898/"
Now you can commit and push the changes, watch for new resources in the test
namespace.
kubectl get all --namespace=test
Once the rollout completed you can deploy a new version by committing the following.
# production/podinfo/patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: podinfo
spec:
template:
spec:
containers:
- name: podinfod
image: stefanprodan/podinfo:3.1.1
Summary, Tips & Tricks
When implementing the above approach on an existing system, naturally, you will need to plan how to do this gradually. There is a natural order of pre-conditions, that have to be fulfilled.
- Ingress-controller should call components via Service Mesh or be able to understand traffic-splitting without it.
- Traffic splitting happens in the originating client, hence it needs to be configured in the origin and not the recieving end, if you use service mesh it will be transparent but do notice that client-facing services will need a supported proxy1 or service-mesh2 that is able to understand traffic splitting.
- Prometheus - We can take "health metrics" from the service mesh, or, we can make more intelligent decisions with each component delivering its own in-depth health metrics
1: See supported proxies on Flagger's website
2: Check the docs on how to inject linkerd into your ingress
Top comments (4)
Hi Or,
Nice article! thanks for writing it. Can you please change the Flagger install command to:
We are working on the v1.0 release and the master branch contains the v1beta1 CRDs. I'll post here when v1.0 is ready so you can remove the version pinning.
Thanks
Hi Stefan,
Thanks for taking your time to read my post, I'll change the command.
P.s: waiting for v1.0, thanks for your great work.
Hey Or - nice write up
I believe that this step
Should be revised to use the ssh key provided by
fluxctl identity --k8s-fwd-ns flux
Hey Gadi,
That's correct, I wanted to make it easier for beginners to just copy the ssh-key without installing the fluxctl cli.
This is out of scope for this post since you'd need to include installation details for the target OS of the reader.
Thank you for reading this :)