In this series I walk through several different open source offerings for performing chaos testing / engineering within your Kubernetes clusters.
In K8s Chaos Dive: Kube-Monkey I covered Kube-Monkey, a simple implementation of the Netflix Chaos Monkey for Kubernetes which allows you randomly kill pods.
This tool is great for getting off the ground with Chaos testing in Kubernetes but has a couple of failings:
- It is only able to kill pods, it can't impact the cluster in any other way.
- It requires you to modify the system under test (SUT) by adding labels. This adds extra overhead pre-test for the engineering team and means you need to redeploy applications to enable / disable chaos testing.
In this post we cover a different tool that offers a richer set of features without the need to modify or redeploy existing applications.
Chaos-Mesh
Introduction
Chaos-Mesh is a chaos engineering toolkit that offers a wide range of testing capabilities, from simple pod killing to IO and Network disruption, for the purpose of validating the failure-resiliency of your services.
The tool runs as two main deployments in the cluster:
- controller-manager - used to schedule and manage the lifecycle of chaos experiments.
- chaos-daemon - a daemonset (runs on every node) with privileged system permissions over a node's network, cgroup, etc.
For some experiments the controller-manager also uses admission webhooks to dynamically inject a chaos-sidecar into pods, for example, in order to hijack the I/O of the application container.
The tests themselves are defined using Kubernetes manifests based on one of the six custom resource definitions that Chaos-Mesh provides:
-
PodChaos
-
pod-kill
- Killing pods. -
pod-failure
- Pods becoming unavailable. -
container-kill
- Killing pods' containers.
-
-
NetworkChaos
-
netem chaos
- Create network delay, duplication, loss, or corruption. -
network-partition
- Simulate network partition through separating pods into several independent subnets by blocking communication between them.
-
-
IOChaos
- Simulate file system faults such as I/O delay or read / write errors. -
TimeChaos
- Inject clock skew into pods. -
StressChaos
-
cpu-burn
- Simulate pod CPU stress. -
memory-bun
- Simulate pod memory stress.
-
-
KernelChaos
- Inject kernel errors into pods.
To create and run experiment, you create a Kubernetes manifest file and deploy to the cluster. The controller-manager will then detect the new experiment object and execute the defined chaos experiment.
In addition to deploying chaos experiments using kubectl / helm, Chaos-Mesh also comes with it's own dashboard through which you can create and monitor experiments - useful if you prefer a GUI!
See below for a high level overview of the setup:
Walk-through
Further details on Chaos-Mesh can be found on it's GitHub repository and in the documentation.
Here we'll walk through setting up the first of three tests:
- A pod killing test using the Chaos-Mesh Dashboard - similar to the one covered in K8s Chaos Dive: Kube-Monkey for comparison.
- A CPU stress test using Kubernetes manifest files - covered in K8s Chaos Dive: Chaos-Mesh Part 2.
- A Memory stress test using Kubernetes manifest files - covered in K8s Chaos Dive: Chaos-Mesh Part 2.
Setting Up A Cluster
I have covered local Minikube Kubernetes cluster setup in a previous tutorial so will not re-visit here in full, please refer to the link for details.
Once you're ready, start your cluster:
minikube start --driver=virtualbox
And this time we will also enable the Kubernetes Metrics Server so we can monitor pod resources later on:
minikube addons enable metrics-server
Deploying A Target Application
Let's deploy some hello-world like nginx pods to target in our experiments (but feel free to use your own applications!). For this we're going to use Helm - a CLI that provides repository management, templating and deployment capabilities for Kubernetes manifests.
If you're own MacOS you can install using Homebrew, installation for other OS' are available on the Helm Installation Docs.
brew install helm
We can now create a new Helm chart (a collection of templated Kubernetes manifests) which will call nginx
:
helm create nginx
The default chart created by Helm in the create
command is for an nginx
image, and we will use this out-of-the-box setup as it suits us just fine!
Next we create a new namespace for our target application(s):
kubectl create ns nginx
And finally we deploy 10 replicas of our nginx application, using Helm, to our nginx
namespace:
helm upgrade --install nginx ./nginx \
-n nginx \
--set replicaCount=10
We can check whether the deployment was successful using both Helm and kubectl:
helm ls -n nginx
kubectl get pod -n nginx
We should see our release is deployed and there should be 10 pods running in the cluster π.
Deploying Chaos-Mesh
Let's now deploy the Chaos-Mesh. In this tutorial I'm going to use the latest direct from the Chaos-Mesh GitHub repository using Helm, but you can also install using an installation script provided by the Chaos-Mesh team - check out the Installation Documentation for further details.
First we clone the repository:
git clone https://github.com/chaos-mesh/chaos-mesh
We can then install the Chaos-Mesh custom resource definitions to our cluster which allow us to define and install our chaos experiments:
$ kubectl apply -f ./chaos-mesh/manifests/crd.yaml
customresourcedefinition.apiextensions.k8s.io/iochaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/kernelchaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/networkchaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/podchaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/podnetworkchaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/stresschaos.chaos-mesh.org created
customresourcedefinition.apiextensions.k8s.io/timechaos.chaos-mesh.org created
Next we create a namespace for the Chaos-Mesh deployments:
kubectl create ns chaos-mesh
And finally, install Chaos-Mesh into the cluster:
helm upgrade --install chaos-mesh ./chaos-mesh/helm/chaos-mesh \
-n chaos-mesh \
--set dashboard.create=true
Note the
--set dashboard.create=true
flag which let's Chaos-Mesh know you wish to use the new (experimental) dashboard.
And that's it! We can check that our installation worked successfully:
$ helm ls -n chaos-mesh
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
chaos-mesh chaos-mesh 1 2020-08-20 17:02:34.347893 +0100 BST deployed chaos-mesh-v0.1.0 v1.0.0
$ kubectl get pods -n chaos-mesh -l app.kubernetes.io/instance=chaos-mesh
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-fd568948-qzvv2 1/1 Running 0 12s
chaos-daemon-sdkq6 1/1 Running 0 12s
chaos-dashboard-6d8466f445-2dgk4 1/1 Running 0 12s
Chaos-Mesh Dashboard
Let's get the dashboard we installed opened up in a browser and have an explore! First we can get Minikube to tell us where it is running and launch it:
minikube service chaos-dashboard -n chaos-mesh
This should load the Chaos-Mesh Dashboard in a browser with an Overview page open showing "Total Experiments" and various other widgets.
Given we haven't created or run any experiments yet it isn't particularly exciting, so let's get cracking and start our first chaos experiment.
Experiment 1: Killing Pods
In this experiment we will create and run a new chaos experiment that will kill a random percentage of our Nginx pods every 30 seconds using the Chaos-Mesh Dashboard.
First we click the "New Experiment" button on left side-menu.
This opens up a "Create A New Experiment" page. Fill in the name of your experiment (e.g. kill-percentage-nginx-pods
) and choose the nginx
namespace.
There are also options to add labels and annotations to your chaos experiment object which we will leave blank in this tutorial, but you might find useful if your setup requires either for audit, automation or other purposes.
You will also notice on the right hand side of the screen there are options to load from previous experiments, archives (deleted experiments) as well as upload from a yaml file. These can be useful if you want to re-run an old test, or upload an existing Kubernetes manifest so you can modify the experiment using the GUI.
Once you have filled in the form, click the "Next" button to proceed.
The second experiment creation page allows you to set the scope of your experiment, i.e. which pods should be impacted. Here we will set the "Namespace Selectors" to nginx
(this may already be pre-populated for you!) and for the second "Label Selectors" field, we will choose app.kubernetes.io/name: nginx
.
These selectors ensure that our experiment will only target the nginx
namespace, and pods in that namespace that have the nginx
name label. For your future experiments you can target several namespaces and / or labels for more complex scenarios.
For the "Mode" dropdown, choose Random Max Percent
- this should cause a new "Mode Value" input field to appear in which we will enter "100". These two fields will configure our experiment to target a random percentage of the eligible pods between 0 and 100%.
Navigating down the page, you may notice there are some additional options which allow you to also select pods based on annotation as well as by phase, e.g. only "Running" or "Pending" pods. There is also a section in which you can manually exclude eligible pods from the experiment which we will leave as-is with all pods selected.
Click the "Next" button to navigate to the "Target" page. Here we can choose the exact type of chaos experiment we want to run from the six available offerings.
For this experiment we will use the default selected option of "Pod Lifecycle" and in the dropdown we will choose the "Pod Kill" PodChaos
action. This will configure the experiment to target the previously selected pods for killing. Click the "Next" button.
The final page allows us to define a schedule for our pod killing experiment. Here we will type @every 30s
into the "Cron" form field so that our experiment schedules the random pod killing every 30 seconds. The schedule accepts any valid cron syntax supported by the robfig/cron Go library.
Let's complete the experiment creation and click the "Finish" button! This will open an "All steps are complete" confirmation page from which you can either navigate back to previous steps, reset the config or submit. Let's submit our experiment by clicking the "Submit" button π.
If we now navigate to the "Experiments" tab in the left side menu we can see our new PodChaos
experiment listed.
Clicking on the experiment we are taken to a details page where we can see the key experiment configuration, a timeline showing experiment execution and an events table allowing you to view details on a particular scheduled event - in this case a schedule pod killing every 30 seconds.
At the top there are also some options to pause the experiment and archive it (which will delete the experiment). In the "Configuration" section there is also an "Update" button which allows you to modify the experiment yaml in a editor modal.
From the timeline we can see that our experiment is running every 30 seconds, and we can confirm this by watching our Nginx pods in the cluster where we can see a random percentage of the pods are being killed every 30s:
$ kubectl get pods -n nginx -w
NAME READY STATUS RESTARTS AGE
nginx-5c96c8f58b-7cstm 1/1 Running 0 100s
nginx-5c96c8f58b-8f9n2 0/1 Running 0 10s
nginx-5c96c8f58b-8htvx 1/1 Running 0 100s
nginx-5c96c8f58b-9vw8v 1/1 Running 0 70s
nginx-5c96c8f58b-cczvx 1/1 Running 0 2m10s
nginx-5c96c8f58b-dnxbz 1/1 Running 0 100s
nginx-5c96c8f58b-p8svr 1/1 Running 0 10s
nginx-5c96c8f58b-plzf6 1/1 Running 0 10s
nginx-5c96c8f58b-ptlsz 1/1 Running 0 100s
nginx-5c96c8f58b-rk4ht 0/1 Running 0 10s
Awesome! We have set up a pod killing chaos experiment and can see it successfully killing our pods. Let's pause the experiment and archive it to remove the experiment from the cluster using buttons on the experiment details page (also available on the "Experiments" page).
You can still find information on your experiment by visiting the "Archives" tab on the left side menu which provides you with a full report on every chaos experiment you have run.
Β Clean-up
Let's clean-up and remove everything we've created today (skip this if you are progressing onto part 2 of this tutorial!).
helm delete chaos-mesh -n chaos-mesh
kubectl delete ns chaos-mesh
kubectl delete crd iochaos.chaos-mesh.org
kubectl delete crd kernelchaos.chaos-mesh.org
kubectl delete crd networkchaos.chaos-mesh.org
kubectl delete crd podchaos.chaos-mesh.org
kubectl delete crd podnetworkchaos.chaos-mesh.org
kubectl delete crd stresschaos.chaos-mesh.org
kubectl delete crd timechaos.chaos-mesh.org
helm delete nginx -n nginx
kubectl delete ns nginx
minikube stop
minikube delete
That's all folks for this tutorial!
There's a lot to take in so have chosen to separate the CPU and memory based experiments into a second follow-up post K8s Chaos Dive: Chaos-Mesh Part 2.
Enjoy the tutorial? Have questions or comments? Or do you have an awesome way to run chaos experiments in your Kubernetes clusters? Drop me a message in the section below or tweet me @CraigMorten!
Till next time π₯
Top comments (1)
Very informative article. Kindly note that chaos-mesh's current github repo is at: github.com/chaos-mesh/chaos-mesh.git