Recently, we had the pleasure of sharing Reliably's ideas on proactive reliability practices with a fabulous group of devops, engineers, architects and SREs at a bank. In conversations that followed, we discussed how chaos engineering relates to the mathematical definitions of chaos.
In my view Chaos Engineering principles do align with mathematical chaos, where a chaotic dynamic system is highly sensitive to input conditions, and can generate a non-linear outcome depending on those conditions, as well as the evolution of those conditions over time.
A good analogy would be weather - it exhibits many of the same conditions of a complex computer system where the input conditions are themselves complex systems, and therefore the outcome is complex and hard to model accurately (which is why 'weather predictions' should be renamed 'weather probabilities').
Any sufficiently complex system can be subject to the behaviours described by chaos theory, and chaos engineering is simply exploring how the system might respond to turbulent conditions. Turbulence was carefully chosen as the term in the Principles of Chaos because it evokes the weather system example as any attempt at proving and predicting from design principles will have the same problems as predicting the weather.
The upside with computer systems over weather systems is that, thanks to tools like the Chaos Toolkit (CTK), we don't simulate the the entire system (like you have to do with weather) but instead run a bounded, controlled experiment that allows us to develop an appreciation for the probability of an outcome when given certain inputs.
I believe that the goal of modern software architecture is to manage the amount of dynamism in a given system. That's a big reason for using things like 'Bounded Context' from DDD, de-coupled service-oriented architectures and *-as-code tools like Terraform, Snyk and Reliably to manage and understand the occurrence of complex events that might impact our system. By reducing the opportunity for an unknown, unplanned input we reduce the opportunity for an unknown, unplanned output from our system.
These ideas have informed the development and roadmaps of both Reliably and the Chaos Toolkit.
Chaos Toolkit enables developer-first discovery of your system's weaknesses through the exploration and testing of your systems as code.
Reliably provides developer-first cloud native application reliability for teams. It enables developers to:
- Define service level objectives as code with your team and all stakeholders.
- Observe service level objective trends over time, surfacing detected reliability weaknesses as you code, continuously with gates and guardrails for Service Level Objective trends and detected weaknesses, right in your own CI/CD platform.
- Alert teams, when your reliability is trending in the wrong direction.
- Explore the impact of chaotic conditions on your reliability through chaos engineering experiments.
- Fix reliability problems using the best advice for your infrastructure.
- Verify continuously if your reliability fixes are having the right effects on your Service Level Objectives.
- Learn, collaborate and share how reliability is managed amongst your team and across your entire organisation.
If you are interested in the concepts and best practices for pro-active reliability and want to learn more, you may find the following resources useful:
- Get started with Chaos Toolkit here and join the Chaos Toolkit community on slack here.
- Get started with Reliably here. The easiest and quickest way of getting started with Reliably is to run the Reliably CLI on your machine with your local source code files. You can find a link to the install guide here.
- Join the Reliability meetup group to learn and share skills and experience with fellow engineers who are introducing pro-active reliability in their organisations. The meetup was paused for a little while due to Covid, but will resume meetups in person and online from September 2021.
Are you introducing chaos engineering or proactive reliability practices in your organisation? I'd love to hear more about your experience! Ping me on slack or share your thoughts below!