In this series I walk through several different open source offerings for performing chaos testing / engineering within your Kubernetes clusters.
Kube-Monkey
Introduction
Kube-Monkey is a simple implementation of the Netflix Chaos Monkey for Kubernetes which allows you randomly delete pods during scheduled time-windows (there has to be some manner of control right? 😏), enabling you to test and validate the failure-resiliency of your services.
The tool runs as a deployment in your cluster, and deletes pods via the Kube API. Unlike some other more complex chaos offerings, it doesn't offer the ability to disrupt the Nodes themselves or impact network or IO, it is purely a pod killing tool. Nevertheless it is quick to configure and deploy and allows you simulate and test your products resiliency to pod failure - it can be valuable to know when you have an outage how quickly will everything come back online (if at all if face a multi-micro-service outage)!
The pod termination schedule is created once a day on weekdays (no weekend callouts, phew! 😅) at a configurable time (default 8am). For each target pod (configured by labels and an allowlist), the scheduler flips a biased coin to determine if a pod should be killed, and if so a random time is selected from the daily window (default 10am to 4pm).
When the time comes to terminate the pod, the eligibility of the pod and other settings are double checked for changes (which are honoured), and if all is still in order, the pod is terminated.
Let's Get Hands-On
Further details about Kube-Monkey can be found on it's GitHub repository, but for now let's get going deploying some chaos and seeing how it all works out!
Setting Up A Cluster
First thing we will need is a Kubernetes cluster to play with. You may already have one set-up (please don't follow this tutorial in a Production cluster! 😱), or you might have a favourite local setup you want to use - the exact cluster and provider shouldn't make much difference (unless you've already locked down security pretty tight!).
If you don't have a cluster to hand then I recommend using Minikube for a local Kubernetes development setup.
The installation instructions are in the link, but I will cover the highlights here as well:
- Check virtualization is supported by your computer.
- Install kubectl.
- Install a hypervisor - I recommend VirtualBox
For MacOS users who have Homebrew setup, this looks something like:
# Check if virtualization is supported on macOS.
# If you see VMX in the output (should be colored) then you
# are good to go!
sysctl -a | grep -E --color 'machdep.cpu.features|VMX'
# Install kubectl CLI for interacting with Kubernetes clusters.
brew install kubectl
# Install VirtualBox which we will use as our hypervisor.
brew cask install virtualbox
Finally you can install Minikube, e.g.
brew install minikube
Once your installation is complete, you can create your local cluster with the following command:
minikube start --driver=virtualbox
This starts Minikube, instructing it to use the VirtualBox driver. It will start a control plane node before then updating and running the Minikube VM on which it downloads and installs the latest stable Kubernetes version.
Once complete you should have a new local cluster, and Minikube will have already configured kubectl
to use the Minikube cluster. You can confirm using:
$ kubectl config current-context
minikube
Now we're good to start deploying some applications!
Deploying A Target Application
We can't kill pods if there are no pods to kill! 😅
Let's deploy some hello-world like nginx
pods (but equally feel free to use your own applications!). For this we're going to use Helm - a CLI that provides repository management, templating and deployment capabilities for Kubernetes manifests.
If you're own MacOS you can install using Homebrew, installation for other OS' are available on the Helm Installation Docs.
brew install helm
We can now create a new Helm chart (a collection of templated Kubernetes manifests) which will call nginx
:
helm create nginx
The default chart created by Helm in the create
command is for an nginx
image, and we will use this out-of-the-box setup as it suits us just fine!
Next we create a new namespace for our target application(s):
kubectl create ns nginx
And finally we deploy 10 replicas of our nginx application, using Helm, to our nginx
namespace:
helm upgrade --install nginx ./nginx \
-n nginx \
--set replicaCount=10
We can check whether the deployment was successful using both Helm and kubectl:
helm ls -n nginx
kubectl get pod -n nginx
We should see our release is deployed and there should be 10 pods running in the cluster 🎉.
Making our application a target
In order for pods to be considered by Kube-Monkey we need to add specific labels to the Kubernetes deployment manifest file.
Open up the ./nginx/templates/deployment.yaml
Helm template in your favourite IDE and modify it to include new kube-monkey
labels to both the metadata.labels
and spec.template.metadata.labels
sections as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "nginx.fullname" . }}
labels:
{{- include "nginx.labels" . | nindent 4 }}
kube-monkey/enabled: "enabled" # Enable termination of this deployment
kube-monkey/identifier: "nginx-victim" # Custom name for our target
kube-monkey/mtbf: "1" # Average number of days between targeting one of these pods
kube-monkey/kill-mode: "random-max-percent" # The killing method
kube-monkey/kill-value: "100" # Killing values, depends on chosen killing method
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "nginx.selectorLabels" . | nindent 6 }}
template:
metadata:
{{- with .Values.podAnnotations }}
annotations:
{{- toYaml . | nindent 8 }}
{{- end }}
labels:
{{- include "nginx.selectorLabels" . | nindent 8 }}
kube-monkey/enabled: "enabled" # See here also
kube-monkey/identifier: "nginx-victim" # See here also
spec:
... rest of file
Let's break our additions down:
- First we've added
kube-monkey/enabled: "enabled"
in two locations. This tells Kube-Monkey that this deployment should be considered a target in it's termination schedule. - Next is
kube-monkey/identifier: "nginx-victim"
. This is a unique identifier label which is used by Kube-Monkey to determine which pods belong to which deployment (because deployment labels are inherited by the pods they create). Generally it's advised to use the same value as the deployments name, but you don't have to (for instance we haven't here!). -
kube-monkey/mtbf: "1"
is the next label.mtbf
stands for "Mean Time Between Failure" and determines the average number of days between which the Kubernetes deployment can expect to have one of it's pods killed. We've set this value to1
which means our nginx pods would be considered every day. Note this isn't an exact number of days between kills, only an average used when determining the likeliness of killing pods during the schedule phase. -
kube-monkey/kill-mode: "random-max-percent"
is an option which allows you to detail how this deployment should be attacked by the Kube-Monkey. There are several options including:-
kill-all
which will result in all pods being killed; -
fixed
which will result in a fixed number of pods being killed; -
random-max-percent
which allows you to define a percentage range for the number pods to be killed - good if you want truly random behaviour. The value provided in thekube-monkey/kill-value
label determines the maximum percentage that can / will be killed during a scheduled period. -
fixed-percent
which is similar tofixed
, but defined as a percentage - better if you use horizontal pod autoscaling and want the number of pods killed to be relative to the total number available.
-
- Lastly there is
kube-monkey/kill-value: "100"
which works alongside thekube-monkey/kill-mode
option to determine the number / percentage of pods to be killed.
Now we've added our labels, let's upgrade our deployment in the cluster and check its status using the same commands as before:
helm upgrade --install nginx ./nginx -n nginx
helm ls -n nginx
kubectl get pod -n nginx
Let There Be Chaos
Now let's introduce Kube-Monkey into our cluster and start creating some chaos.
First we clone the repo:
git clone https://github.com/asobti/kube-monkey
We can then create a new namespace for the Kube-Monkey deployment and deploy using Helm, same as we did for our Nginx application:
# Create the namespace
kubectl create ns kube-monkey
# Deploy Kube-Monkey
helm upgrade --install kube-monkey ./kube-monkey/helm/kubemonkey \
-n kube-monkey \
--set config.debug.enabled=true \
--set config.debug.schedule_immediate_kill=true \
--set config.dryRun=false \ # Be careful!
--set config.whitelistedNamespaces="{nginx}"
# Check the deployment status
helm ls -n kube-monkey
kubectl get pod -n kube-monkey
You may notice that we provided a few additional configuration options when we deployed Kube-Monkey with Helm:
- We enabled debug - in your clusters you won't want to do this as it will generate verbose logs, but for this tutorial / testing it can be useful to see what is going on.
- We also set
schedule_immediate_kill
totrue
. This is a debug option that instead of scheduling a new chaos window every day, schedules a new window every 30s so you can test out your configuration easily without having to wait a day between tests! - We have set
dryRun
tofalse
- this means Kube-Monkey will actually kill the target pods. Be sure to test out first withdryRun
set totrue
in any important clusters so you can be sure that the correct pods will be targeted! (We want controlled chaos, bringing down prod is not the goal 😂) - Finally we have set the
whitelistedNamespaces
array to add ournginx
namespace to the allowed list.
Because we have set schedule_immediate_kill
to true
, Kube-Monkey will immediately start applying the configured kill instructions. We can see this working by checking out the Kube-Monkey logs:
$ kubectl logs -n kube-monkey -l release=kube-monkey -f
...
I0819 13:08:47.192768 1 kubemonkey.go:19] Debug mode detected!
I0819 13:08:47.192853 1 kubemonkey.go:20] Status Update: Generating next schedule in 30 sec
I0819 13:09:17.193689 1 schedule.go:64] Status Update: Generating schedule for terminations
I0819 13:09:17.215602 1 schedule.go:57] Status Update: 1 terminations scheduled today
I0819 13:09:17.215809 1 schedule.go:59] v1.Deployment nginx scheduled for termination at 08/19/2020 09:09:22 -0400 EDT
********** Today's schedule **********
k8 Api Kind Kind Name Termination Time
----------- --------- ----------------
v1.Deployment nginx 08/19/2020 09:09:22 -0400 EDT
********** End of schedule **********
I0819 13:09:17.218029 1 kubemonkey.go:62] Status Update: Waiting to run scheduled terminations.
I0819 13:09:22.632324 1 request.go:481] Throttling request took 103.486967ms, request: DELETE:https://10.96.0.1:443/api/v1/namespaces/nginx/pods/nginx-6bb5bbd776-fclgb
I0819 13:09:22.836478 1 request.go:481] Throttling request took 151.926056ms, request: GET:https://10.96.0.1:443/api/v1/namespaces/nginx/pods/nginx-6bb5bbd776-xkzg8
I0819 13:09:23.033219 1 request.go:481] Throttling request took 181.015105ms, request: DELETE:https://10.96.0.1:443/api/v1/namespaces/nginx/pods/nginx-6bb5bbd776-xkzg8
I0819 13:09:23.049849 1 kubemonkey.go:70] Termination successfully executed for v1.Deployment nginx
I0819 13:09:23.049869 1 kubemonkey.go:73] Status Update: 0 scheduled terminations left.
I0819 13:09:23.049876 1 kubemonkey.go:76] Status Update: All terminations done.
I0819 13:09:23.049999 1 kubemonkey.go:19] Debug mode detected!
I0819 13:09:23.050006 1 kubemonkey.go:20] Status Update: Generating next schedule in 30 sec
...
Here we can see that scheduled a termination window, targeted our Nginx deployment before killing 2 of our 10 pods - it looks like it's working!
Let's just check that our pods are actually being killed:
$ kubectl get pod -n nginx -w
NAME READY STATUS RESTARTS AGE
nginx-6bb5bbd776-b5drk 1/1 Running 0 6m53s
nginx-6bb5bbd776-b6gqj 1/1 Running 0 5m28s
nginx-6bb5bbd776-b7pkj 1/1 Running 0 5m29s
nginx-6bb5bbd776-bl88b 1/1 Running 0 12s
nginx-6bb5bbd776-cswxk 1/1 Running 0 12s
nginx-6bb5bbd776-f84mr 1/1 Running 0 11s
nginx-6bb5bbd776-krkgn 1/1 Running 0 2m44s
nginx-6bb5bbd776-nvf42 1/1 Running 0 2m44s
nginx-6bb5bbd776-s4n7l 1/1 Running 0 12s
nginx-6bb5bbd776-w2chn 0/1 Running 0 11s
There we have it, it looks like 5 of our 10 pods have been killed in the last 12s as well now. Think we can call that a success 🎉.
Clean Up
In it's current state, Kube-Monkey will continue to kill our Nginx pods every 30s until the end of time. Let's save some energy and do some clean-up!
helm delete kube-monkey -n kube-monkey
kubectl delete ns kube-monkey
helm delete nginx -n nginx
kubectl delete ns nginx
That should be both the Nginx and Kube-Monkey deployments removed from our cluster. We can also tear down the Minikube cluster by running:
minikube stop
minikube delete
And that should be us back to square one.
Next Steps
So today we've successfully:
- Created a new Kubernetes cluster.
- Deployed an Nginx application configured to be disrupted by Kube-Monkey.
- Deployed Kube-Monkey into the cluster to kill our Nginx pods on a scheduled basis.
- Cleaned up after our experiment by deleting all of our deployments and deleted the Kubernetes cluster.
What's next is to use Kube-Monkey for chaos experiements in your pre-production (or even production if brave!) Kubernetes clusters and start reviewing and validating your applications' resiliency. Here's some pointers:
- Update your existing Kubernetes manifests or Helm charts with the appropriate
kube-monkey
labels. Perhaps you can start withkube-monkey/enabled: "disabled"
to gain confidence that your application still deploys without issue. - Add the Kube-Monkey Helm chart to your collection of Helm charts. Update the
values.yaml
and various templates to meet your needs - for instance you might want to set an appropriate value fortimeZone
andlogLevel
as well as your own custom schedule windows using therunHour
,startHour
andendHour
options. - Deploy your newly configured Kube-Monkey chart to your Kubernetes cluster - perhaps with
dryRun
set totrue
intially so you can follow the logs and make sure that it is going to behave as expected. - Set Kube-Monkey
dryRun
tofalse
and start regularly chaos testing! You should configure alerts to capture any undesired behaviour and monitor cluster and application health regularly - I recommend checking out Prometheus for cluster telemetry and alerting (using alertmanager) and Grafana for monitoring dashboards.
That's It!
That's all folks - hope that was a quick and useful tutorial into setting up Kube-Monkey for simple pod-killing based chaos testing.
What are you guys using for chaos testing in Kubernetes? Have any cool suggestions, questions or comments - drop them in the section below!
Till next time y'all! 👋
Top comments (0)