DEV Community

Hamdi KHELIL
Hamdi KHELIL

Posted on

Chaos Engineering Let's Break Everything! ๐Ÿ˜ˆ

Introduction

Hey there! ๐Ÿ‘‹ If you're running your applications on Kubernetes, you might already know that things can go wrong in unexpected ways. That's where chaos engineering comes in! Chaos engineering is all about intentionally injecting failures into your system to see how it behaves under stress. The idea is to discover weaknesses and fix them before they can cause real problems.

Today, we're diving into Chaos Mesh, an awesome tool that makes chaos engineering in Kubernetes super easy and fun (well, as fun as breaking things can be!). We'll go step-by-step through setting up Chaos Mesh and show you how to run some cool chaos experiments to test your app's resilience.

Setting Up Chaos Mesh

First things first, letโ€™s get Chaos Mesh installed on your Kubernetes cluster. Donโ€™t worry, itโ€™s straightforward!

1. Installing Chaos Mesh with Helm

๐Ÿš€ Step 1: Add the Chaos Mesh Helm repository:

helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
Enter fullscreen mode Exit fullscreen mode

๐Ÿš€ Step 2: Install Chaos Mesh:

kubectl create ns chaos-testing
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-testing
Enter fullscreen mode Exit fullscreen mode

๐Ÿš€ Step 3: Verify the installation:

kubectl get pods -n chaos-testing
Enter fullscreen mode Exit fullscreen mode

You should see a bunch of pods up and running, like chaos-controller-manager, chaos-daemon, and chaos-dashboard. ๐ŸŽ‰

2. Accessing the Chaos Mesh Dashboard

Chaos Mesh comes with a handy web dashboard where you can create and manage your chaos experiments.

๐ŸŒ Step 4: Port-forward the dashboard service:

kubectl port-forward -n chaos-testing svc/chaos-dashboard 2333:2333
Enter fullscreen mode Exit fullscreen mode

๐ŸŒ Step 5: Access the dashboard by heading to http://localhost:2333 in your browser. Youโ€™ll now be able to start breaking thingsโ€ฆ I mean, testing things! ๐Ÿ˜…

Creating Chaos Experiments

Now that Chaos Mesh is up and running, letโ€™s start experimenting! Below are some chaos scenarios you can try out, along with the main options you can configure in each experiment.

1. Simulating Network Latency

๐Ÿšฆ Whatโ€™s the deal?

Network latency can happen for all sorts of reasons, and it can really mess with your appโ€™s performance. Letโ€™s see how your app handles it.

Example:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
  namespace: chaos-testing
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      "app": "my-app"
  delay:
    latency: "200ms"
  duration: "1m"
  scheduler:
    cron: "@every 3m"
Enter fullscreen mode Exit fullscreen mode

Main Options:

  • action: Type of network fault. Options include delay, loss, duplicate, corrupt, partition.
  • mode: Specifies how the chaos is applied. Options include one, all, fixed, fixed-percent, random-max-percent.
  • selector: Used to select target pods based on namespace, labels, or fields.
  • delay.latency: Time to delay network packets.
  • duration: How long the experiment should last.
  • scheduler: Defines when the experiment should run (using cron syntax).

๐Ÿ“š Learn more about NetworkChaos in the Chaos Mesh docs

2. Killing a Pod (Pod Chaos)

๐Ÿ’ฅ Whatโ€™s the deal?

Sometimes, pods just die. It could be due to resource exhaustion, bugs, or something else. Letโ€™s simulate a pod crash and see what happens!

Example:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      "app": "my-app"
  duration: "10s"
  scheduler:
    cron: "@every 1m"
Enter fullscreen mode Exit fullscreen mode

Main Options:

  • action: Type of pod fault. Common options are pod-kill, container-kill, pod-failure.
  • mode: Specifies how the chaos is applied. Options include one, all, fixed, fixed-percent, random-max-percent.
  • selector: Used to select target pods based on namespace, labels, or fields.
  • duration: How long the experiment should last.
  • scheduler: Defines when the experiment should run (using cron syntax).

๐Ÿ“š Learn more about PodChaos in the Chaos Mesh docs

3. CPU Stress Test

๐Ÿ”ฅ Whatโ€™s the deal?

High CPU usage can slow things down or even cause crashes. Letโ€™s crank up the CPU usage and see how your app handles the heat.

Example:

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      "app": "my-app"
  stressors:
    cpu:
      workers: 4
  duration: "30s"
  scheduler:
    cron: "@every 5m"
Enter fullscreen mode Exit fullscreen mode

Main Options:

  • stressors.cpu.workers: Number of CPU workers to stress the target pods.
  • mode: Specifies how the chaos is applied. Options include one, all, fixed, fixed-percent, random-max-percent.
  • selector: Used to select target pods based on namespace, labels, or fields.
  • duration: How long the experiment should last.
  • scheduler: Defines when the experiment should run (using cron syntax).

๐Ÿ“š Learn more about StressChaos in the Chaos Mesh docs

4. Simulating Disk Pressure

๐Ÿ’พ Whatโ€™s the deal?

Running out of disk space or dealing with slow disk I/O can cause major issues. Letโ€™s simulate disk pressure and observe the impact.

Example:

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: disk-stress
  namespace: chaos-testing
spec:
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      "app": "my-app"
  stressors:
    io:
      workers: 2
  duration: "40s"
  scheduler:
    cron: "@every 7m"
Enter fullscreen mode Exit fullscreen mode

Main Options:

  • stressors.io.workers: Number of I/O workers to stress the target pods.
  • mode: Specifies how the chaos is applied. Options include one, all, fixed, fixed-percent, random-max-percent.
  • selector: Used to select target pods based on namespace, labels, or fields.
  • duration: How long the experiment should last.
  • scheduler: Defines when the experiment should run (using cron syntax).

๐Ÿ“š Learn more about StressChaos in the Chaos Mesh docs

5. Network Partition

๐ŸŒ Whatโ€™s the deal?

Network partitions, where parts of your system canโ€™t talk to each other, can cause all kinds of chaos. Letโ€™s split your network and see what breaks!

Example:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition
  namespace: chaos-testing
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      "app": "frontend"
  direction: both
  target:
    selector:
      namespaces:
        - default
      labelSelectors:
        "app": "backend"
  duration: "60s"
  scheduler:
    cron: "@every 10m"
Enter fullscreen mode Exit fullscreen mode

Main Options:

  • action: Type of network fault. Here itโ€™s partition, which simulates a network partition.
  • mode: Specifies how the chaos is applied. Options include one, all, fixed, fixed-percent, random-max-percent.
  • selector: Used to select target pods based on namespace, labels, or fields.
  • direction: Defines the direction of the partition (from, to, both).
  • target.selector: Specifies the target pods that will be isolated from the selected pods.
  • duration: How long the experiment should last.
  • scheduler: Defines when the experiment should run (

using cron syntax).

๐Ÿ“š Learn more about NetworkChaos in the Chaos Mesh docs

Monitoring the Impact of Chaos

While your chaos experiments are running, itโ€™s super important to keep an eye on your appโ€™s performance. Hereโ€™s how to do it:

  • Use Monitoring Tools: Make sure youโ€™ve got tools like Prometheus and Grafana set up to track things like response times, error rates, and resource usage. ๐Ÿ“Š
  • Check Your Logs: Keep an eye on your logs to spot any errors or warnings that pop up during the chaos experiments. ๐Ÿ•ต๏ธ
  • Analyze Metrics: Look at the data youโ€™re collecting to understand how your app is handling the chaos. Are there timeouts? Increased latency? Use this info to improve your systemโ€™s resilience. ๐Ÿ”

Best Practices for Chaos Engineering with Chaos Mesh

๐ŸŒŸ Start Small: Donโ€™t go all-in right away. Start with simple experiments and work your way up to more complex scenarios.

๐ŸŒŸ Test in Staging: Before you unleash chaos in production, run your experiments in a staging environment to avoid any nasty surprises.

๐ŸŒŸ Automate Tests: Integrate chaos experiments into your CI/CD pipeline. This way, youโ€™ll automatically test your appโ€™s resilience with every new deployment.

๐ŸŒŸ Monitor Everything: Make sure you have comprehensive monitoring in place so you can quickly spot and respond to any issues caused by the chaos experiments.

๐ŸŒŸ Iterate and Improve: Use what you learn from the experiments to make your app stronger. Keep refining your chaos tests as your system evolves.

Conclusion

And there you have it! ๐ŸŽ‰ Chaos Mesh is an incredible tool that makes chaos engineering in Kubernetes not only possible but also enjoyable. By running these experiments, youโ€™ll uncover weaknesses in your system that you might never have found otherwise.

Remember, the goal of chaos engineering isnโ€™t to break things just for fun (though it can be fun ๐Ÿ˜œ), but to learn how to build more resilient and reliable systems. So start small, experiment often, and keep improving your appโ€™s resilience. Happy chaos engineering! ๐Ÿš€

For more detailed documentation on each type of chaos experiment and more advanced configurations, you can check out the official Chaos Mesh Documentation.

Top comments (0)