DEV Community


Posted on

Simian Army - Netflix - Disaster Recovery - AWS

Simian Army is a set of tools developed by Netflix for simulating failure scenarios in cloud infrastructure. It includes several different tools, such as Chaos Monkey and Latency Monkey, which are designed to randomly terminate instances or introduce latency in order to test the resilience of a system. The goal of Simian Army is to help engineers build systems that can withstand failures and remain available even in the face of unexpected events.
It's available for AWS and other cloud providers.

You can find the git repo here developed in java.

AWS Simian Army is a set of open-source tools that helps to identify potential availability risks and improve the resilience of your AWS infrastructure. The tools are designed to simulate different types of failures that can occur in a production environment, such as instance termination and network latency. By running these simulations, you can identify and fix weaknesses in your infrastructure before they cause a real outage. AWS Simian Army includes the following tools:

Chaos Monkey randomly terminates instances in an Auto Scaling group to test the system's ability to tolerate instance failures.

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Conformity Monkey finds instances that don’t adhere to best-practices and shuts them down. For example, we know that if we find instances that don’t belong to an auto-scaling group, that’s trouble waiting to happen. We shut them down to give the service owner the opportunity to re-launch them properly.

Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances. Once unhealthy instances are detected, they are removed from service and after giving the service owners time to root-cause the problem, are eventually terminated.

Janitor Monkey ensures that our cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them.

Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal.

10–18 Monkey (short for Localization-Internationalization, or l10n-i18n) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. It verifies that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.

By running these tools, you can gain insight into the resilience and availability of your infrastructure, and take steps to improve it

Top comments (0)