I have an API Gateway endpoint that routes requests to a lambda function that writes data into an S3 bucket.
In high-volume scenarios, when managing a significant influx of requests, there is an increased risk of failures when writing data to S3. This could be due to various factors such as network issues, S3 service disruptions, concurrent write conflicts, or exceeding capacity limits (Amazon S3 supports request rates of up to 3500 Put requests per second to a single prefix. In some scenarios, rapid concurrent Put requests to the same key can result in a 503 response.). Therefore, addressing and mitigating these potential failure points is essential to ensure the reliability of data writes to S3 in such situations.
An example where this might occur is having a webhook that hits your endpoint when a change happens in an external system. I've experienced this in tools such as Slack, Github, Jira, etc. Considering the large volume of data generated by these tools, using your endpoint frequently at a larger scale could pose challenges.
Settle in with your favourite beverage as we're about to explore some details together; this post is set to be a bit lengthier than usual.
Build the project
We will start by building a project with SST that provisions an API Gateway, a Lambda, and an S3 bucket. Once implemented, we'll look into testing for concurrent write conflicts or exceeding capacity limits.
The architecture looks as follows:
Run the following to create the project locally:
npx create-sst@latest
Ok to proceed? (y) y
? Project name s3-writes-reliability
✔ Copied template files
Next steps:
- cd s3-writes-reliability
- npm install (or pnpm install, or yarn)
- npm run dev
Delete the unnecessary infrastructure under stacks
and packages
. The infrastructure lives under stacks
and the actual Lambda code lives under packages
.
Create the following stack:
import { StackContext, Api, Bucket, Config } from "sst/constructs";
import * as iam from "aws-cdk-lib/aws-iam";
export function API({ stack }: StackContext) {
const bucket = new Bucket(stack, "Uploads");
const api = new Api(stack, "api", {
routes: {
"POST /todo": "packages/functions/src/lambda.handler",
},
});
const BUCKET_NAME = new Config.Parameter(stack, "BUCKET_NAME", {
value: bucket.bucketName,
});
api.attachPermissions([
new iam.PolicyStatement({
actions: [
"s3:PutObject",
"s3:PutObjectAcl",
"s3:ListBucket",
"s3:GetObject",
],
effect: iam.Effect.ALLOW,
resources: [
bucket.bucketArn + "/*",
],
}),
]);
api.bindToRoute("POST /todo", [BUCKET_NAME, bucket]);
stack.addOutputs({
ApiEndpoint: api.url,
});
}
This stack will provision the API Gateway endpoint, the lambda function and the S3 bucket to which we will write.
Now let's update the lambda function under packages/functions/src/lambda.ts
:
import { ApiHandler } from "sst/node/api";
import { Config } from "sst/node/config";
import { PutObjectCommand, S3Client } from "@aws-sdk/client-s3";
const client = new S3Client({});
export const handler = ApiHandler(async (_evt) => {
const params = {
Body: _evt.body,
Bucket: Config.BUCKET_NAME,
Key: Math.random().toString(),
};
const response = await client.send(new PutObjectCommand(params));
return {
statusCode: 200,
body: JSON.stringify(response)
};
});
We built an API endpoint that invokes a Lambda, which in turn puts an object in S3.
Deliberate Error Creation
We will explore the S3 Capacity limit (3500 requests/second), looking to hit that limit. This exercise's purpose is to mirror real-life scenarios where a Lambda function might be invoked at scale, allowing us to explore the behaviour and performance of the system under heavy load.
First attempt - lambda making lots of requests - failed
To achieve this, I first tried to modify the Lambda function to simulate the generation of this significant volume of requests. I changed the lambda function to the following:
import { ApiHandler } from "sst/node/api";
import { Config } from "sst/node/config";
import { PutObjectCommand, S3Client } from "@aws-sdk/client-s3";
const client = new S3Client({});
async function uploadObject(key:any, data: any) {
const params = {
Bucket: Config.BUCKET_NAME,
Key: key,
Body: data
};
try {
await client.send(new PutObjectCommand(params));
console.log(`Successfully uploaded: ${key}`);
} catch (error) {
console.error(`Error uploading ${key}: ${error}`);
}
}
export const handler = ApiHandler(async (_evt) => {
try {
const promises = [];
for (let i = 1; i <= 5000; i++) {
promises.push(uploadObject(String(i), _evt.body));
}
await Promise.all(promises);
} catch (error) {
console.error("Unhandled error in handler:", error);
return {
statusCode: 500,
body: JSON.stringify({ error: "Internal Server Error", details: error.message }),
};
}
});
Afterwards, I used curl
to invoke the function (Change the API URL to point to the endpoint that SST outputted upon starting the project):
curl -X POST -H "Content-Type: application/json" -d '{"key1": "value1", "key2": "value2"}' https://sdfsdfsdf.execute-api.ap-southeast-2.amazonaws.com/todo
I didn't get any 503 Slow Down
from S3, indicating that the number of requests is high - that's because the script above won't run within a single second.
Second attempt - loading the API endpoint - successful
The second attempt invokes the Lambda function with lots of simultaneous requests. This would spin up multiple Lambda instances to cater for the load, and each instance would hit S3 at the same time. Let's see if that'll work.
I'll use Apache JMeter to do this experiment:
This test resulted in many errors related to the Lambda function being throttled due to exceeding the rate limit "errorType":"ThrottlingException","errorMessage":"Rate exceeded"
.
I've changed the Jmeter threads to 1000 to match the limit of 1000 concurrent executions allowed by Lambda in a single region. At the same time, I kept using promise.all
in the Lambda code to achieve parallel requests to S3. With this combination, I managed to get a few SlowDown: Please reduce your request rate.
errors.
Using Jmeter is a better approach to simulate a real case scenario, such as having a Slack or Github webhook that is hitting the endpoint.
Build a more reliable system
We managed to get the S3 service to return an error that we're exceeding the rate limit. How can we build a more reliable architecture that would help us avoid this kind of error?
We have a few options here:
- Distribute objects across multiple S3 bucket prefixes as the rate limit is set by prefix (this means changing the behaviour of the application, which might not be something you're planning to do)
- Use a retry mechanism - this could be handled on the Lambda side, where we retry submitting requests that return a
503
error. This is a great solution, but it depends on your traffic and whether postponing the request would work. It's worth testing this solution, especially since it doesn't require a big effort. - Alleviate the load on S3 by introducing a queue between your APIGW and Lambda function. We'll implement this solution.
The architecture looks as follows:
In this diagram, SQS serves as a buffer between APIGW and Lambda. Instead of directly processing incoming requests, we enqueue them in an SQS queue. This decouples the incoming request rate from the rate at which Lambda and S3 can process the requests, which smoothes out spikes in the incoming request rate, reducing the immediate load on S3 and providing a more controlled flow of data. If there are temporary issues with S3 or Lambda, messages can be retained in the SQS queue and retried, providing a more resilient system.
Let's create this additional infrastructure. Modify the previously created stack as follows:
const queue = new Queue(stack, "Queue", {
consumer: {
function: {
handler: "packages/functions/src/lambda.handler",
timeout:10,
environment:{ BUCKET_NAME: bucket.bucketName },
}
}
});
queue.attachPermissions([
new iam.PolicyStatement({
actions: [
"s3:PutObject",
"s3:PutObjectAcl",
"s3:ListBucket",
"s3:GetObject",
],
effect: iam.Effect.ALLOW,
resources: [
bucket.bucketArn + "/*",
],
}),
]);
This creates a queue and repurposes the Lambda we created to poll that queue. In addition, we attached the permission to put objects in S3 to the consumer Lambda.
Now, let's update the API route we have to write into the queue rather than triggering the Lambda function:
const api = new Api(stack, "api", {
routes: {
"POST /": {
type: "aws",
cdk: {
integration: {
subtype: HttpIntegrationSubtype.SQS_SEND_MESSAGE,
parameterMapping: ParameterMapping.fromObject({
QueueUrl: MappingValue.custom(queue.queueUrl),
MessageBody: MappingValue.custom("$request.body"),
}),
}
}
}
},
});
Once done, let's update the Lambda function. We need to update the event to be of type SQSEvent
, then pass the SQS Record body to the uplaodObject
function:
const promises = _evt.Records.map(record => uploadObject(record.messageId,record.body));
await Promise.all(promises);
Is that enough to alleviate the load from S3 so that we stop having request rate errors? Let's try it again using Jmeter.
Unfortunately, I still encountered a few | + Error uploading 9: SlowDown: Please reduce your request rate.
errors. Setting the eventSource batchSize to 5 seems to have fixed the issue of the rate limit on S3. It might not work where requests are being more aggressively sent. In this case, we can still set a deliveryDelay
parameter on the queue, which postpones the delivery of new messages for some time.
Conclusion
In addressing challenges related to high-volume data writes on S3, we’ve proposed a more reliable architecture involving queues. However, this solution isn’t one-size-fits-all. The queue solution has its own problems to consider — how to handle failed uploads, and should a dead letter queue be considered? Is the message size limit in SQS a problem?
Building a more reliable system is an iterative process, and this article provides a foundation for enhancing the robustness of serverless architectures in handling substantial data loads. Ultimately, developers are encouraged to adapt and refine these solutions based on their specific use cases.
Finally, I wouldn’t recommend rushing into building this kind of solution from the first day, unless you’re dealing with some serious data traffic. As with all software development, it’s important to be pragmatic and iteratively build your system blocks as you need.
How do you enhance the resilience of your system? Share your thoughts in the comments.
Top comments (0)