It's all Chaos! And it Makes for Resilience at Scale

#sre #devops

By: Emily Arnott

Originally published on Failure is Inevitable.

Chaos engineering is a practice where engineers simulate failure to see how systems respond. This helps teams proactively identify and fix preventable issues. It also helps teams prepare responses to the types of issues they cannot prevent, such as sudden hardware failure. The goal of chaos engineering is to improve the reliability and resilience of a system. As such, it is an essential part of a mature SRE solution.

But integrating chaos engineering with other SRE tools and practices can be challenging. To get the most from your experiments, you’ll need to tie in learnings across all your reliability practices. You’ll also need to adjust your chaos engineering as your organization scales. In this blog post, we’ll look at:

How SRE and chaos engineering intersect
Best practices for chaos engineering
What experiments look like at different maturity levels

Chaos engineering and SRE

It’s clear how the goals of SRE and chaos engineering align. Both practices encourage teams to build resilience into their systems. But the connections don’t stop there. Many SRE practices integrate with chaos engineering to increase the effectiveness of both. Below are a few examples.

SLOs as chaos engineering scoreboards
When running chaos engineering experiments, it's important to determine how impactful the hypothetical failure would be. This can be difficult.

Consider a test that shows that an entire service would go offline if a certain server fails. You estimate that it would take an hour or so to return to normal operation. It sounds frightening, but what if that service was only used by a tiny fraction of your customers?

Another test might show that, when traffic surpasses a certain threshold, a page accessed by every customer loads 3 seconds slower. This scenario could have more customer impact than the other. Teams would want to focus on resolving this issue first.

SLOs allows you to compare these scenarios using the most important metric: customer impact. SLIs, or service level indicators, are built from the service metrics that matter most to customers. SLOs, or service level objectives, show the level of failure customers will tolerate.

When you run chaos experiments, you can determine how the experiment would affect the SLO. This gives you a triaging model for the lessons of different experiments. You can then focus on preventing the incidents that affected the SLO most.

Chaos engineering as runbook bootcamp
It's important to simulate the impact of a hypothetical failure and work to prevent it. But there will still be incidents, no matter how many experiments we run. Chaos engineering also gives teams the space to practice response measures. This can help responders work faster and with more confidence during a real incident.

In SRE, incident responses are codified as runbooks. These are guides broken down into modular checks and steps. Where possible, runbooks are automated to save toil. Of course, runbooks can never be perfect. Regular review is necessary to ensure that all information is up-to-date and comprehensive.

Chaos engineering can help improve runbooks by providing more opportunities to evaluate them. Teams will not use runbooks addressing a type of rare, catastrophic failure often. When the failure does occur, your team will need to know it can trust the runbook. By running chaos experiments of this scenario, you’ll find potential stumbling blocks.

Runbooks can also serve as inspiration for chaos experiments. If a runbook has been “gathering dust,” you can design an experiment to put it to use. This will ensure it’s up-to-date with your system and still useful.

Building a library of chaos engineering retrospectives
A valuable tool in your SRE tool belt is the incident retrospective. This is a document built by teams responding to an incident. It contains the incident timeline, key communications, follow-up actions, and more. Incident retrospectives form a valuable hub of knowledge. They are invaluable for onboarding and developing a culture of continuous improvement.

Chaos engineering can help build your library of retrospectives. Teams should write retrospectives about chaos experiments as they would for a real incident. Include details about why and how the experiment was conducted to be thorough. Reviewing these retrospectives can provide the same beneficial insights as a real incident.

Conversely, incident retrospectives can motivate good chaos experiments. Imagine your team had difficulty responding to a particular incident. Reviewing the incident retrospective revealed why the team stumbled. Your team creates a plan for incidents like these moving forward. “Replaying” the incident will give you a direct comparison between the new and old methods. It can help you avoid making the same mistakes.

A maturity model of chaos engineering

As organizations grow in maturity, adopting chaos engineering as a practice provides more opportunities. But there are also challenges. This chart breaks down what to expect at each stage.

No matter what maturity your organization is, the best time to try chaos engineering is now. The sooner you can build experimenting into your routines, the more time you’ll have to develop your expertise.

Blameless can help you make the most of your chaos engineering experiments. Our SLO, runbook documentation, and incident retrospective tools can help you get the most from every experiment. To see how, check out a demo.

If you enjoyed this blog post, check out these resources:

DEV Community

It's all Chaos! And it Makes for Resilience at Scale

Chaos engineering and SRE

A maturity model of chaos engineering

Top comments (0)

Read next

How to Thoroughly Cancel All AWS Services: A Step-by-Step Guide

Easily Share ComfyUI Online Using Pinggy

DevOps Engineer Career Path Guide

Headlamp: Microsoft’s Open-Source k8s Manager