DEV Community

Cover image for Monitoring Litmus Chaos Experiments
Karthik Satchitanand for LitmusChaos

Posted on • Edited on

Monitoring Litmus Chaos Experiments

Without Observability, there is no Chaos Engineering. This is a line that I picked from a nice article on chaos, and I couldn't agree more. The very nature of voluntary chaos injection demands that we have the right monitoring aids to validate our experiment's hypothesis around application/microservices behavior. Typically, that can be mapped to The Four Golden Signals.

Having said that, observability in itself has many facets, just as chaos engineering does. Chaos not only helps test resiliency in terms of service availability (HA), it is also a means to refine alerting and notification mechanisms, streamline the incident response structure and measure key performance indicators (KPIs) such as mean time to detect an anomaly (MTTD), mean time to recovery, say, to optimal performance (MTTR) and sometimes even the time to resolve (another MTTR!), either via self-heal or manual effort, in cases where the chaos experiment is deliberately executed with a high blast radius. There are several tools one could employ today to obtain and visualize this data, which is the other facet to observability that I mentioned earlier. Some tools can even help with automated Root Cause Analysis. Check out this cool demo by folks from Zebrium which demonstrates automated detection of incidents induced via Litmus Chaos experiments.

While there is a lot to discuss and learn about the Whys & Hows of observability with chaos engineering, in this blog we shall get started with a simple means of mapping application behavior with chaos ongoings, i.e., find a way to juxtapose application metrics with chaos events. And to do that, we will make use of the de-facto open-source monitoring stack of Prometheus & Grafana. This is intended to get you rocking on your chaos observability journey, which will get more exciting with continuous enhancements being added into the LitmusChaos framework

Test Bed

What better than the sock-shop demo application to learn about microservices behavior? A quick set of commands should get you started. A Kubernetes cluster is all you need!

  • Obtain the demo artefacts
git clone https://github.com/litmuschaos/chaos-observability.git
cd chaos-observability/sample-application/sock-shop
Enter fullscreen mode Exit fullscreen mode
  • Setup Sock-Shop Microservices Application
kubectl create ns sock-shop
kubectl apply -f deploy/sock-shop/
Enter fullscreen mode Exit fullscreen mode
  • Verify that the sock-shop microservices are running
kubectl get pods -n sock-shop
Enter fullscreen mode Exit fullscreen mode
  • Setup the LitmusChaos Infrastructure
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.6.0.yaml
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-admin-rbac.yaml
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.6.0?file=charts/generic/experiments.yaml -n litmus 
Enter fullscreen mode Exit fullscreen mode

Monitoring Aids

The LitmusChaos framework provides various Kubernetes chaos events against the ChaosEngine & ChaosResult custom resources, right from the pre-chaos validation checks, through chaos injection and post-chaos health checks in order to trace the ongoings over the course of the chaos experiment. Converting these events into metrics is a great way to integrate with existing off-the-shelf application dashboards to gain a clear understanding of application behavior through chaos injection and revert actions.

In this exercise, we make use of Heptio's event router to convert the chaos events into metrics and then instrument the standard sock-shop application's Grafana dashboard with appropriate queries to achieve our goal.

Setup the Monitoring Infrastructure

  • Step-1: Lets setup the event router with the HTTP sink to convert the kube cluster events into metrics.
kubectl apply -f deploy/litmus-metrics/01-event-router-cm.yaml
kubectl apply -f deploy/litmus-metrics/02-event-router.yaml
Enter fullscreen mode Exit fullscreen mode
  • Step-2: We will set up Prometheus & Grafana deployments with NodePort (you could change it to Loadbalancer if you prefer) services
kubectl apply -f deploy/monitoring/01-monitoring-ns.yaml
kubectl apply -f deploy/monitoring/02-prometheus-rbac.yaml
kubectl apply -f deploy/monitoring/03-prometheus-configmap.yaml
kubectl apply -f deploy/monitoring/04-prometheus-alert-rules.yaml
kubectl apply -f deploy/monitoring/05-prometheus-deployment.yaml
kubectl apply -f deploy/monitoring/06-prometheus-svc.yaml
kubectl apply -f deploy/monitoring/07-grafana-deployment.yaml
kubectl apply -f deploy/monitoring/08-grafana-svc.yaml
Enter fullscreen mode Exit fullscreen mode
  • Step-3: Access the grafana dashboard via the NodePort (or loadbalancer) service IP

Note: To change the service type to Loadbalancer, perform a
kubectl edit svc prometheus -n monitoring and replace type:
NodePort to type: LoadBalancer

  kubectl get svc -n monitoring 
Enter fullscreen mode Exit fullscreen mode

Default username/password credentials: admin/admin

  • Step-4: Add the Prometheus datasource for Grafana via the Grafana Settings menu

Alt Text

  • Step-5: Import the grafana dashboard "Sock-Shop Performance" provided here

Alt Text

Execute the Chaos Experiments

For the sake of illustration, let us execute a CPU hog experiment on the catalog microservice & a Memory Hog experiment on the orders microservice in a staggered manner

kubectl apply -f chaos/catalogue/catalogue-cpu-hog.yaml
Enter fullscreen mode Exit fullscreen mode

Wait for ~60s

kubectl apply -f chaos/orders/orders-memory-hog.yaml
Enter fullscreen mode Exit fullscreen mode

Verify execution of chaos experiments

kubectl describe chaosengine catalogue-cpu-hog -n litmus
kubectl describe chaosengine orders-memory-hog -n litmus
Enter fullscreen mode Exit fullscreen mode

Visualize Chaos Impact

Observe the impact of chaos injection through increased Latency & reduced QPS (queries per second) on the microservices under test.

Alt Text

Alt Text

Conclusion

As you can see, this is an attempt to co-relate application stats to the failure injected, and hence a good starting point in your chaos monitoring journey. Try this out & share your feedback! A lot more can be packed into the dashboards to make the visualization more intuitive. Join us in this effort and be part of SIG-Observability within LitmusChaos!!

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you?
Join Our Community On Slack For Detailed Discussion, Feedback & Regular Updates On Chaos Engineering For Kubernetes: https://kubernetes.slack.com/messages/CNXNB0ZTN
(#litmus channel on the Kubernetes workspace)
Check out the Litmus Chaos GitHub repo and do share your feedback: https://github.com/litmuschaos/litmus
Submit a pull request if you identify any necessary changes.

Top comments (0)