In the past, software systems ran in highly controlled environments on-premise and managed by an army of sysadmins. Today, migration to the cloud is relentless; the stage has completely shifted.
Systems are no longer monolithic and localized; they depend on many globalized uncoupled systems working in unison, often in the form of ethereal microservices.
It is no surprise that Site Reliability Engineers have risen to prominence in the last decade. Modern IT infrastructure requires robust systems thinking and reliability engineering to keep the show on the road. Downtime is not an option.
A 2020 ITIC Cost of Downtime survey indicated that 98% of organizations said that a single hour of downtime costs more than $150,000. 88% showed that 60 minutes of downtime costs their business more than $300,000. And 40% of enterprises reported that one hour of downtime costs their organizations $1 million to more than $5 million.
To increase the resiliency of these systems, the discipline of chaos engineering emerged. Stress testing a system with chaotic experiments by randomly engineering failures reveals the Achilles heel. Simulating adverse conditions allows engineers to integrate safeguards, circuit breakers, and incident response mechanisms. This post will dive into this chaotic artform.
As elucidated by the Chaos Community:
"Chaos Engineering is the discipline of experimenting on a system to build confidence in the system's capability to withstand turbulent conditions in production."
Most engineers' first exposure to this discipline most likely came from Chaos Monkey. A tool invented by Netflix, it is responsible for randomly terminating instances in production to ensure that engineers implement their services to be highly available and resilient to pseudo-random termination of instances and services within the Netflix architecture. At the time, Netflix had recently migrated to Amazon Web Services and needed to create a framework to prove its infrastructure could survive a fallout and automatically self-heal.
Netflix added more techniques to this framework, such as "Failure Injection Testing" (FIT), which causes requests between Netflix services to fail and verifies that the system degrades gracefully. And of course, tools within The Simian Army such as Chaos Kong, which simulates the failure of an entire Amazon EC2 (Elastic Compute Cloud) region.
This amalgamation of tools evolved into the discipline that we now know as chaos engineering.
Designing any experiment requires four things: a postulation and hypotheses, independent variables, dependent variables, and of course, context. These principles provide a guidepost for designing chaos engineering experiments:
- Construct a hypothesis around steady-state behavior.
- Trigger real-world behavior, utilizing both a control and an experimental group.
- Run experiments in production by injecting failures into the experimental group.
- Automate experiments to run continuously, attempting to disprove the hypothesis that your system is resilient.
A viable hypothesis may be in the following form:
"the tolerance range will not exceed 8% above the steady- state while injecting X, Y, Z into the system."
Robust experiments should trigger the loss of availability of several components within the system. Experiments need to mimic real-world events, avoiding the happy path. Tests should utilize all possible inputs while also recreating scenarios from historical system outages.
Related Reading: What is Chaos Testing?
Types of testing include:
- Hardware failure (or virtual equivalent)
- Changes to network latency/failure (inject latency into requests between services)
- Resource starvation/overload
- Dependency failures (e.g., database)
- Retry storms (e.g., thundering herd)
- Functional bugs (exceptions)
- Race conditions (threading and concurrency)
- Render an entire Amazon region unavailable
- Fail requests between services or fail an internal service
The Steady-State defines your environment's status before, after, and potentially during a chaos experiment execution. Any change to this hypothesis results in a deviation, which is a candidate for investigation. And warrants further investigation, potentially a place for improvements to be applied.
If your system does not return to its expected Steady State after running an experiment, a red flag alert needs to be issued. A robust system will self heal and recalibrate back to equilibrium or steady state. One can calculate deviation from equilibrium by defining a tolerance range.
An experiment requires manual testing on conception but needs to be added to an automation framework after that. Netflix runs Chaos Monkey continuously during weekdays, but only runs Chaos Kong exercises once a month. Every organization requires its own nuanced approach.
It is essential to minimize the blast radius while designing chaos experiments, ideally one small failure at a time. Measure experiments carefully, ensuring they are low-risk: involve few users, limit user flows, limit the number of live devices, etc. As one begins, it is wise to inject failures that verify functionality for a subset or small group of clients and devices.
As these low-risk experiments succeed, you can then proceed to run small-scale diffuse experiments that will impact a small percentage of traffic, which is distributed evenly throughout production servers.
A small-scale diffuse experiment's main advantage is that it does not cross thresholds that could open circuits. This allows one to verify single-request fallbacks and timeouts while demonstrating the systems resilience to transient errors. It verifies the logical correctness of fallbacks, but not the characteristics of the system during large-scale fallout.
The following is a list of tools to get you started:
Chaos Monkey:The OG of chaos engineering. The tool is still maintained and currently integrated into Spinnaker, a continuous delivery platform developed initially by Netflix to release software changes rapidly and reliably.
Mangle: Enables one to run chaos engineering experiments against applications and infrastructure components and quickly assess resiliency and fault tolerance. Designed to introduce faults with minimal pre-configuration and supports a wide range of tooling, including K8S, Docker, vCenter, or any Remote Machine with SSH enabled.
Gremlin: Founded by the former Netflix and Amazon engineers who productized Chaos as a Service (CaaS). Gremlin is a paid service that gives one a command-line interface, agent, and intuitive web interface that allow you to set up chaos experiments in no time. Don't worry. There's a big red HALT button that makes it simple for Gremlin users to reactively rollback experiments in the case of an attack negatively impacting the customer experience.
Chaos Toolkit: An open-source project that tries to make chaos experiments easier by creating an open API and standard JSON format to expose experiments. They are many drivers to execute AWS, Azure, Kubernetes, PCF, and Google cloud experiments. It also includes integrations for monitoring systems and chat, such as Prometheus and Slack.
There are numerous reasons to invest in chaos engineering.
To start, it forces organizations to implement business continuity planning (BCP) and disaster recovery frameworks. Implementing these frameworks gives organizations a strategic advantage compared to their competitors because they elucidate the organization's awareness of their operational vulnerabilities and demonstrate a proactive approach in addressing them. This imparts trust to stakeholders and customers.
Additionally, organizations operating in critical infrastructure industries within the EU will have to abide by the EU's Directive requirements on network and information systems' security, meaning they will be legally obliged to implement incident response capabilities.
As chaos engineering is an experimentation approach, it gives us a holistic view of the system's behavior and how all the moving parts interact in a given set of circumstances, allowing us to derive insights into the system's technical and soft aspects (aka, the human factor).
Chaos engineering will enable organizations to find security vulnerabilities that are otherwise challenging to detect by traditional methods due to distributed systems' complex nature.
This may include losses caused by human factors, poor design, or lack of resiliency. For example, conventional approaches may consist of red and purple team exercises that focus on an adversarial process, allowing organizations to test how security systems and teams respond to active threats.
This post originally appeared on the Xplenty blog here. Check it out on our website and subscribe to our newsletter if you want to hear more updates.