Chaos engineering is a process of stressing an application in test or production environment by creating disruptive events such as server outage, API throttling or latency and then observes how the system responds
Then we implement our improvements, and we do that to prove or disprove our assumptions of our system capability to handle these disruptive elements.
Rather these events happen at 3 am or in weekend or in cloud, we create them in a controlled environment during working hours when all our teams and engineers are ready to tackle the issue
- Chaos engineering improves he performance of application
- Uncovers hidden issues
- Exposes monitoring observability and alarm blind spots Improves recovery time, operation scales and lot more.
Now I will be mentioning few numbers from a survey that happened this year on 2020s data by Gremlin.
- 34% of organizations are using chaos engineering officially and actively as a practice
- 60% of the teams application availability has enhanced, which is between 99.5% and 99.99%
- around 23% of the teams/orgs have less than 1 hour as MTTR (which is mean time to recovery from any system outage or fatal failure)
- 60% of them have run at least one chaos engineering attack
- For 61.4% orgs the high severity incidents have come down to 1–10 incidents per month
- There's been increased interest for chaos engineering practices across world and google search numbers speak for itself Also if I filter orgs that have more than 10k employees, 70% of them does daily/weekly or monthly attacks
Chaos engineering has some challenges in terms of industry adoption
- 20% of the orgs are still not aware
- Other 20% have some other priorities over this
- Next 20% has no experience of it etc.
- In 2020, Chaos Engineering went mainstream and made headlines in Politico (hacking) and Bloomberg (Pentagon security issues).
- Gremlin hosted the largest Chaos Engineering event ever, with over 3,500 registrants.
- Github has over 200 Chaos Engineering related projects with 16K+ stars.
- And most recently, AWS released their own public Chaos Engineering offering, AWS Fault Injection Simulator.
- 60% of industry run Chaos Engineering attack.
- Along with large organizations, smaller teams are also adopting
- DevOps, SRE, Infra, Operations even Development teams are embracing the practise
- Many orgs are moving experimentation to production 459,548 attacks using Gremlin platform (2020 data)
- Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior
- Hypothesize that this steady state will continue in both the control group and the experimental group.
- Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
- Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
- If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large.
AWS Fault Injection Simulator is a fully managed tool that allows you to execute fault injection experiments on AWS, making it simpler to improve an application's speed, observability, and resilience. Fault injection experiments are employed in chaos engineering, which is the technique of straining an application in testing or production settings by introducing disruptive events such as a rapid spike in CPU or memory usage, monitoring how the system responds, and making changes. The fault injection experiment assists teams in creating the real-world conditions required to reveal hidden defects, monitoring blind spots, and performance bottlenecks that are difficult to detect in distributed systems.
Watch out the next blog for more details on AWS FIS