In a complex setup where hundreds of microservices talk to and depend on each other to function properly, things can always go wrong. Especially when multiple teams are involved in making changes. Sometimes multiple teams are involved in making changes on the same microservice! This is a sure shot recipe for issues that will crop up at stages starting from integration to weird production issues that may have no functional basis at all. Some issues are purely related to improper error handling, dependency management or infrastructure glitches causing cascading issues. It is this problem that Resiliency Testing tries to solve. While these need not be automated all the time, but having these automated (for a mature product) is most likely to be beneficial.
Core Aim of Resiliency Tests:
- Identify Resiliency Issues with the Application Under Test
- Help the team identify markers of issues that may otherwise take multiple hours to decode in a production environment
- Make a more observable system by introducing errors that will expose the need for more logging and tracing.
The two main areas for inducing uncertainty in a system are:
Infrastructure: Randomly shutting down instances and other infrastructure parts
Application: Introduce failures during runtime at a component level (e.g. endpoint/request level)
You then enable uncertainty randomly or intentionally via experiments:
- More suitable for ‘disposable’ infrastructure (e.g. ec2 instances)
- Tests redundant infrastructure for impact on end-users
- Used when impact is well-understood
- Accurately measure impact
- Control over experimental parameters
- Suitable for complex failures (e.g. latency) when impact is not well understood
Finally, you can categorize failure modes as follows:
Resource: CPU, memory, IO, disk
Network: Blackhole, latency, packet loss, DNS
State: Shutdown, time, process killer
Many of these modes can be applied or simulated at the infrastructure or application level:
After running an experiment, there are typically two potential outcomes. Either the system is found to be resilient to the introduced failure, or a problem is identified that may needed to be fixed. Both of these are good outcomes. In the first case, the confidence in the system and its behaviour is enhanced. In the other case, a problem has been found before it causes an outage.
By proactively testing and validating our system’s failure modes we can reduce operational burden, increase resiliency, and will eventually lead to reduced palpitations at the time of production issues.