DEV Community

Cover image for Chaos Engineering with AWS FIS and Lambda
Jason Butz for AWS Community Builders

Posted on • Originally published at jasonbutz.info

Chaos Engineering with AWS FIS and Lambda

Recently AWS's Fauly Injection Service (FIS) added support for AWS Lambda, maybe it's the other way around, but either way, they now work together. I'd never given FIS much focus; most organizations I work with aren't ready for or interested in chaos engineering. However, the more I looked into FIS, the more I realized I had misjudged what FIS was capable of and where I could use it.

What is Chaos Engineering?

"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production."
Principles of Chao Engineering

I first heard about chaos engineering due to Netflix and their Chaos Monkey tool, which later evolved into the now retired Simian Army and has since been split out again. I've actually worked in an environment where Chao Monkey was running, and it does influence how you build applications, but once you lay your foundations and patterns, it's not too bad.

AWS primarily identifies FIS as a resiliency-focused tool, but it uses chaos engineering principles at its heart. You're impacting the performance of your system through an experiment to see how things behave to different failures.

Introduction to AWS FIS

AWS FIS enables you to run experiments on certain AWS resources to test how your system responds to different fault conditions. FIS has a limited selection of actions that can be performed against different targets, i.e., AWS resources. At the time of writing, FIS can target 18 different types of resources:

  • Aurora DB clusters
  • RDS DB instances
  • DynamoDB global tables
  • EBS volumes
  • EC2 Auto Scaling groups
  • EC2 instances
  • EC2 Spot Instances
  • ECS clusters
  • ECS tasks
  • EKS clusters
  • EKS node groups
  • EKS Kubernetes pods
  • S3 buckets
  • VPC subnets
  • Lambda functions
  • ElastiCache (Redis OSS) Replication Groups
  • IAM roles
  • transit gateways

In a FIS experiment, you combine actions and targets, generally running them for a specific duration. You can add additional parameters that help focus your experiment, such as stopping only 1% of EC2 instances.

During and after your experiment, you can inspect your logs and other metrics to see how the system performed. FIS offers a feature to generate a report that collects CloudWatch metrics and combines them with experiment details into a PDF report. I've looked at the example report, and I'm not sure it's worth the $5 cost to generate the report.

FIS and Lambda

With AWS's recent release, there are three actions available that can target Lambda functions:

  • Invocation delay (invocation-add-delay)
  • Invocation error (invocation-error)
  • Invocation HTTP integration response (invocation-http-integration-response)

Adding an invocation delay is what it sounds like. It's similar to a Lambda cold start, but the delay added by FIS is after any cold starts. In addition to simulating a cold start, you can create timeout events by configuring the added latency higher than the Lambda function's timeout.

The invocation error action allows you to mark function invocations as failed. That will be helpful for testing error handling and retry mechanisms. Interestingly, you can also decide if you want to allow the Lambda function to execute its handler. That could be useful if you want to test whether certain operations are actually idempotent, i.e., performing the same action multiple times has no effect after the first time the action was performed. It might also be a way to avoid interrupting known idempotent operations and allow them to process events despite the experiment.

The HTTP integration response action is intended to work with Application Load Balancers (ALBs), API Gateways, and VPC Lattice. You can select a content type and HTTP response code that are returned. You cannot set the response body, which I find very disappointing. You can also decide whether to allow the Lambda function to execute its handler with this action. Again, this could test idempotency or limit the disruption caused by the experiment.

How does FIS work with Lambda?

The Lambda functions targeted by your FIS experiments must have the FIS Lambda Layer added. This is central to how FIS can perform different actions. You also need an S3 bucket to store the experiment configuration.

When the FIS experiment is initializing, FIS uses the service role you define in the experiment template to write configuration files to the S3 bucket at a well-known path. The FIS Lambda Layer, using your Lambda function's execution role, checks that well-known path in the configured S3 bucket at a configured frequency. Those checks from your Lambda function to the S3 bucket happen regardless of any FIS experiments, so they are increasing the number of S3 API calls and, presumably, the billed duration for your Lambda functions.

Lambda service and FIS service interact with the defined S3 bucket to read and write configurations

The FIS Lambda Layer uses the AWS Lambda Runtime API proxy to intercept function invocations. This is before the invocation reaches the runtime, which makes the FIS layer runtime agnostic. As part of configuring your Lambda functions for FIS, you set the environment variable AWS_LAMBDA_EXEC_WRAPPER to /opt/aws-fis/bootstrap. This is what enables that Lambda Runtime API proxy. FIS uses some of the Lambda runtime environment modification capabilities to provide functionality. You don't need to worry too much about the details unless you are using additional extensions to the Lambda environment, in which case you'll need to set up a proxy chain.

Diagram showing the FIS Lambda extension wrapping the Lambda runtime and proxying calls to the Lambda runtime API

Setting up a Lambda function for FIS

As mentioned earlier, you must have an Amazon S3 bucket to use with the FIS experiment. It must be in the same region as the experiment you plan to run. You must also update your Lambda execution role with a policy to allow access to the S3 bucket as specific prefixes. You will also need an IAM policy on the role associated with your FIS experiment that grants access to the bucket, allows FIS to inspect Lambda functions, and enables FIS to do tag-based lookups. Details on what these IAM policies should look like are in the FIS documentation.

Once you have all of that, you need to make a few minor modifications to your Lambda functions that will be involved in the experiment. First, you should add the FIS Lambda layer, details and ARNs for the layers in different regions are in the AWS documentation. Then you should add two environment variables to the functions. The first is AWS_FIS_CONFIGURATION_LOCATION with an S3 bucket ARN that points to the FisConfigs prefix in the S3 bucket you set up, for example arn:aws:s3:::my-config-distribution-bucket/FisConfigs/. This lets the FIS layer know where it should look for configuration details. The second is AWS_LAMBDA_EXEC_WRAPPER with /opt/aws-fis/bootstrap as the value. This sets up the Lambda Runtime API proxy mentioned earlier.

This configuration is only the most basic configuration; depending on your functions, additional considerations may exist. Some of these considerations are:

  • Short experiment action durations
  • Using SnapStart
  • Fast and infrequently invoked functions
  • Functions already using Lambda extensions
  • Functions using container runtimes

The AWS FIS documentation outlines what you need to know.

How about an example?

These details are great, but how about an example showing what you can do with FIS and Lambda?

AWS architecture sketch, showing an API Gateway with an arrow pointing to an SQS queue that has arrows pointing to both a Lambda function and an SQS DLQ. The SQS DLQ is connected to an EventBridge pipe

Let's take the architecture sketched out above. Our Lambda function is invoked with messages from the SQS queue, but if the messages fail to process three times, they are sent to the DLQ and onto an EventBridge pipe, which leads to logic not relevant to this example. In this example, we follow AWS best practices and have our visibility timeout on the SQS queue set to six times the Lambda timeout. With a Lambda timeout of 10 seconds, our visibility timeout is 60 seconds.

For this example, our EventBridge pipe and the associated error-handling system are new and need testing. We've already tested our logic by sending messages directly to the DLQ, but now we need to test the entire thing. We've decided to use FIS to simulate the Lambda function taking too long to process messages. We've also decided to run this experiment in our development environment. We're lucky and can break the processing of these messages for a short period of time.

In our FIS experiment template, we'll configure an aws:lambda:invocation-add-delay action with a startup delay of 30,000 milliseconds (30 seconds). This will ensure our Lambda function invocations time out. Since we can break the processing of these messages, we can set our action to run 100% of the time. We need to do a little math to ensure we keep our experiment running long enough.

Once our experiment is running, our Lambda function will be invoked (and time out) three times for every message added to the queue. Between those invocations, the message will be delayed for at least the length of time of our visibility timeout. Because our Lambda function will be timing out, its duration doesn't matter. The SQS queue's visibility timeout is the important duration here.

60 seconds×3 delivery attempts=180 seconds 60\ seconds \times 3\ delivery\ attempts = 180\ seconds

It will take at least 180 seconds before each message is added to the DLQ. Assuming we can cause messages to be added to the queue quickly, we should be able to get multiple messages to the DLQ within a 5-minute duration for this action. That gives us a little wiggle room. If getting messages into the queue takes longer, we will want a longer duration, for example, 10 minutes. We should get some information if the duration is at least 3 minutes. Less than 3 minutes or 3 minutes exactly, and we might not receive messages to our DLQ. There can be a delay of up to a minute before the Lambda function follows the actions outlined in our experiment, and the visibility timeout on the SQS is a minimum duration of time before the message is placed back in the queue. It does not mean it will be immediately reprocessed at that time.

We should be able to test this path in our application with all these configurations, setting up our targets for the experiment and making the changes needed to our Lambda function.

Final Thoughts

Digging into AWS FIS's Lambda support expanded my understanding of chaos engineering. It helped me see far more possibilities for where it can be used than I had ever thought about. Chaos engineering is much more than terminating instances; with carefully planned experiments, you can test exactly how your system responds to a variety of issues. I hope the FIS team will continue to expand support for additional functionality with Lambda and other AWS services.

Top comments (0)