A year ago, I got really interested in the idea of chaos engineering. I had read a couple of blog posts and I was ready to get started with breaking things in production. In a controlled way of course. To get started, I created a small application that would get us started with some basic chaos experiments. In this post, I want to share some things I learned while taking our first steps in chaos engineering.
Chaos engineering is understanding and improving the resiliency of your systems through experimentation. First you create a hypothesis for how your system will behave when you put it through a hard time in some way. Then you verify if the system behaves as you expected. For example, you could cause a network problem between the reader and writer nodes of your database cluster, and then verify that this does not cause any requests to the application that uses this database to fail.
Because it is really hard to understand all the components of your systems and how exactly they work together. Even if you think you have a good grasp on your system as a whole, there are too many details to all of the components for one person to know. Although we can easily think in abstractions while developing our application, in production everything is pretty concrete and our assumptions are tested whether we want it or not.
Let's go over the lessons we learned.
There is no point in starting with chaos engineering if you don't have the right level of observability of your systems. It is the process of investigating why your system didn't do what you hypothesized that will make you understand your system's behavior better. Without proper logs and metrics, that is going to be hard.
To get started, you need to decide what you are going to experiment with. For us, running our workloads in AWS, it is relatively easy to test with certain scenario's. Rebooting or terminating an EC2 instance is a straightforward action. It is very tempting to list everything that you can easily do, and to create experiments for those. So that is what I did when I introduced our own chaos creator.
We ended up with a tool that was performing server maintenance type tasks on a schedule. Few of the reboots and failovers it did ever happened out in the wild. It's easy to make things break, it's harder to make things break in a realistic way, even though that is where more of the learning is done.
We care very much about completely automating repetitive tasks at Coolblue. When I started working on these experiments, I was convinced by the literature that mentioned the importance of automatically running your experiments. In practice, this didn't work out as well as I would have liked.
My mistake was automating the experiments immediately. Defining and creating experiments is a process of exploration. By automating right from the start, you are slowing down the rate at which you can try new things. I quickly settled on a small number of experiments and kept rerunning those daily or weekly. But this defeated the purpose of the whole exercise, which is to learn about the behavior of your systems.
As soon as you learn something, you want to cement that knowledge. This is where automation becomes very powerful. It allows you to regularly rerun experiments, so you can check if they still have the same outcome. This allows you to act on any regressions, in your application or in your understanding of your application.
For companies just starting out with chaos engineering, automation is a good way to detect regressions, but not a great way to start. In the end, it's 20% about breaking things in production, but 80% about learning about the behavior of your system in production. And that learning you can not automate.