Michael Levan

Posted on Aug 24, 2021

Three Tips To Understand Chaos Engineering

#chaosengineer #devops #sre #cloudnative

Chaos Engineering is extremely new from a name and process perspective, but it's not new to technology. Engineers have been doing the types of testing, experimentation, and research you'd see in Chaos Engineering for years. Since the beginning of computers, there have been several tests completed in production and development environments, but here's the key difference:

Chaos Engineering is planned and it's a role carved out for these types of tests.

The typical testing you'll see in most environments is done by the engineer working on the product. Whether they're in software, DevOps, cloud, architecture, etc., the testing is typically done by them.

The question then becomes - why would we need chaos engineering and how can we understand it?

In this blog post, you'll learn just that.

Why Chaos Engineering

Most engineers if they've worked in cloud, DevOps, SRE, and sysadmin environments have been woken up in the middle of the night. Whether it's for an application that went down or a server that couldn't handle the load, the engineer is woken up and has to solve the issue. For anyone that's been through this, you can probably say with confidence that it's quite annoying.

A few questions pop up those nights:

Why wasn't this tested for the load?
Why is this server so small for this type of application?
What could've caused this? Where are the logs?
Does this need some sort of scaling or high availability?

And many other questions...

These questions, including the hundreds of others, are what is answered by a Chaos Engineer.

Chaos Engineering is to answer the what if and why questions. It's to ensure that whatever is thrown at a distributed system, microservice architecture, cloud environment, and application can withstand the impact.

Chaos Engineers take the approach of finding the problems, typically, in a step-by-step manner:

Start by defining the ready-state. What the environment is supposed to look like and how you're expecting it to perform.
Create a hypothesis and conduct research for both the testing/staging experimental environment and the production environment
Perform tests and experiments that implement real-world issues. Two perfect examples are 1) what if a region goes down in AWS? 2) What if the Kubernetes cluster goes down?
Take what you learned from the hypothesis/research, compare it to the ready-state from step 1, and see if everything went as planned. If it didn't, you know you have to iterate. For example, if you took down a Kubernetes cluster and it didn't failover to another cluster, you know that must be implemented.

Chaos Engineering is all about testing, research, and making a production environment as stable as possible so no one gets the 2:00 AM calls, no applications go down, and users stay happy.

Experiments in and out of Production

You'll sometimes hear Chaos Engineers or DevOps/SRE folks performing Chaos tests where they claim that they just throw a bunch of hiccups/experiments into production and see what happens. They probably say this because it sounds cool, but it's not the actual way to do it (and it's most likely not the way they do it either).

Chaos Engineering, more importantly, testing in and out of production, is all about controlled experiments. The end goal is to run the experiments in production, but you can't do that without testing and running said experiments in a dev/test/staging environment. Otherwise, management and engineering leads will never allow you the opportunity to implement this again.

Instead, think about Chaos testing in a straightforward, yet concise way:

The goal is to find holes and vulnerabilities in a system and/or application

The holes and vulnerabilities come in all shapes and sizes - the network isn't set up to take a packet storm hit, the servers won't scale out if the application requires more resources, Kubernetes pods won't come back online if they're all killed, and many other types of holes/vulnerabilities. Your job is to come up with a way to find/stop as many holes/vulnerabilities as possible.

Start in a controlled environment (testing/staging) and once you're comfortable, have all of the research, and the experiments are passing, you can move to production.

How to Experiment

Experimenting on an application or system isn't as simple as running a test from your localhost for 5 minutes, seeing that the app works, and saying yep, we're good to go. It's far more complex than that. In most environments, there are tens of hundreds of things that can go wrong. You can't possibly know all of them, but it's your job to find them.

Think about experimentation for a system like a medical researcher would. Whether they're working on some sort of vaccine, cure, or controlled substance, it's their job to find any single thing that can go wrong with it. Why? Because lives are at stake.

In tech, no one's life is at stake. However, people's livelihood is. If an application or platform constantly goes down, users will stop using it. Without users, the organization makes no money. Without money, the organization can't keep its doors open. Without its doors open, you can't get paid.

It's everyone's job, in one way, shape, or form, to help this effort in an organization, but it's your job to ensure that doesn't happen from an engineering perspective.

Automated Tests

From what you read about Chaos tests, it can almost sound like QA (Quality Assurance). It really shouldn't be looked at like that though. It should be looked at more as an R&D (Research and Development) type of role, but for tech folks. Because of that, you don't really want to do Chaos testing in a manual QA style. Instead, you want to automate the workflow.

Keep in mind that you cannot automate something that you've never done manually because you don't know how it's supposed to work. Manually running your first Chaos tests is perfectly normal.

A lot of the Chaos Engineer platforms do have schedules that you can set up to run tests automatically, and you can also do this yourself. If you find a platform that doesn't have schedules, it most likely has some sort of API. At that point, you can write some automation code to say, kick off a test at a certain time. Maybe get a little experimental with it, put it in a cron job that runs on a Lambda Function, and have the Chaos test run automatically! The possibilities for automating the workflow are endless.

Chaos Platform References

If you're interested in Chaos Engineering, I recommend a few platforms:

Gremlin
Chaos Mesh
Chaos Monkey

Chaos testing is fun, exciting, and genuinely helps an organization meet its goals. It's certainly an up-and-coming career, but I believe it will take off in the next 1-3 years.

Top comments (5)

Olaoluwa Mustapha • Sep 2 '21

I really enjoyed this, I have been very interested in chaos engeering because I believe it IS one of the best ways to make highly fault tolerant systems.

However what happens when something messes up as a result of testing in prod?

Michael Levan • Sep 2 '21

The idea is that you conduct experiments prior to testing in prod. If something gets messed up in prod, you know that the testing wasn't fully done appropriately and you must go back and reiterate.

Jack • Sep 1 '21

This was fascinating, thanks for writing, how does one learn chaos engineering? Is it a branch of DevOps? Are there courses?

Michael Levan • Sep 2 '21

Chaos Engineering is a practice in itself and not really a position. It's about finding every single vulnerability, hole, and potential hiccup in a distributed system and/or application.

Jack • Sep 2 '21

Okay that makes sense, thank you for the explanation