Chaos Engineering

#learning #productivity #programming #development

A few years ago, if you were a software developer and you asked your manager/team leader if you could test the system in production, they probably would have given you a bombastic sideways glance, which was a clear sign of how absurd the idea was, but guess what? Nowadays, thanks to technological development and research conducted, it is possible to test in production through a technique called Chaos Engineering. So what is Chaos Engineering and why is it important?

Chaos engineering is a method of testing or experimenting with a system or software by intentionally introducing errors or rather turbulent situations with the goal of building resilience and confidence in the system.
Imagine going on a boat ride with your friend who can swim but is not a very good swimmer, and you decide to push them into the lake to boost their confidence. Now this could result in them struggling to stay afloat(hating you,lol!), but boosting their confidence in their swimming skills, which means that the next time they are at a lake/swimming pool, their swimming skills will have improved.
Chaos engineering vs Software testing
Chaos engineering differs from software testing in that it intentionally introduces bugs into the system and its goal is to build resilience. In contrast, software testing is more focused on ensuring that the system works as expected.

Chaos engineering can help answer functionality questions such as

What would happen if the application fails or receives too much traffic?
What happens if a particular service is unavailable?
Because chaos engineering allows us to test a system under "stressful circumstances," it provides insights into how applications can be improved for the future.

History of Chaos Engineering
Chaos Engineering was developed at Netflix in 2010 after the company moved to an Aws cloud infrastructure to deliver its entertainment services to its growing number of customers. They developed Chaos Monkey to test and ensure that a failed AWS component would not impact Netflix’s streaming experience. Why the name “Chaos Monkey” describes a scenario of what would happen if a wild monkey was placed in a data center and started chewing away at various cables in the data center, causing a lot of “chaos," right?

Assuming you are now convinced that chaos engineering is relevant, the next question is
How does chaos engineering work? Here are some principles that govern chaos engineering
Define the normal state of your system - This means that you fully understand the behavior of your system using some metrics such as system response times, error rates, and the number of services that need to run in the best case.
Create a hypothesis for stable behavior - After you understand the stable state, you can create a hypothesis for the behavior of the system if there is a failure of any of the components or microservices. Some key clues might be as follows:

How does the I/O behave?
Does latency occur? If so, to what extent?

Simulate various real-world events - Think of some real-world scenarios that could cause disruptions in your system. For example, a server failure or a software error.

Conduct experiments in production.

Automate experiments and run them continuously

Minimize the blast radius - Chaos engineering experiments can have some negative effects. Therefore, it is important to regulate the process by ‘starting small” with the type of aftereffect you would experience and gradually increasing to reach full scale

Tools for Chaos Engineering
Some of the tools available for chaos engineering include
Chaos Monkey, Litmus, Chaos Mesh, Chaos Blade
Chaos Native, Gremlin