Chaos Engineering and AWS FIS

#aws #cloud #chaosengineering #testing

What is Chaos Engineering

Chaos engineering is a process of stressing an application in test or production environment by creating disruptive events such as server outage, API throttling or latency and then observes how the system responds

Then we implement our improvements, and we do that to prove or disprove our assumptions of our system capability to handle these disruptive elements.

Rather these events happen at 3 am or in weekend or in cloud, we create them in a controlled environment during working hours when all our teams and engineers are ready to tackle the issue

Benefits

Chaos engineering improves he performance of application
Uncovers hidden issues
Exposes monitoring observability and alarm blind spots Improves recovery time, operation scales and lot more.

Some facts

Now I will be mentioning few numbers from a survey that happened this year on 2020s data by Gremlin.

34% of organizations are using chaos engineering officially and actively as a practice
60% of the teams application availability has enhanced, which is between 99.5% and 99.99%
around 23% of the teams/orgs have less than 1 hour as MTTR (which is mean time to recovery from any system outage or fatal failure)
60% of them have run at least one chaos engineering attack
For 61.4% orgs the high severity incidents have come down to 1–10 incidents per month
There's been increased interest for chaos engineering practices across world and google search numbers speak for itself Also if I filter orgs that have more than 10k employees, 70% of them does daily/weekly or monthly attacks

Source: Gremlin

Challenges

Chaos engineering has some challenges in terms of industry adoption

20% of the orgs are still not aware
Other 20% have some other priorities over this
Next 20% has no experience of it etc.

Evolution

In 2020, Chaos Engineering went mainstream and made headlines in Politico (hacking) and Bloomberg (Pentagon security issues).
Gremlin hosted the largest Chaos Engineering event ever, with over 3,500 registrants.
Github has over 200 Chaos Engineering related projects with 16K+ stars.
And most recently, AWS released their own public Chaos Engineering offering, AWS Fault Injection Simulator.

Chaos Today

60% of industry run Chaos Engineering attack.
Along with large organizations, smaller teams are also adopting
DevOps, SRE, Infra, Operations even Development teams are embracing the practise
Many orgs are moving experimentation to production 459,548 attacks using Gremlin platform (2020 data)

Chaos process

Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior
Hypothesize that this steady state will continue in both the control group and the experimental group.
Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large.

AWS FIS

AWS Fault Injection Simulator is a fully managed tool that allows you to execute fault injection experiments on AWS, making it simpler to improve an application's speed, observability, and resilience. Fault injection experiments are employed in chaos engineering, which is the technique of straining an application in testing or production settings by introducing disruptive events such as a rapid spike in CPU or memory usage, monitoring how the system responds, and making changes. The fault injection experiment assists teams in creating the real-world conditions required to reveal hidden defects, monitoring blind spots, and performance bottlenecks that are difficult to detect in distributed systems.