If you are using AWS Lambda rule actions with your AWS IoT Rules, then this post might be of interest to you. I have a question to you: do you know for a fact that all your events are processed successfully, and the function execution is successful for all your events? How do you know that? If these questions make you curious, keep on reading.
This post will focus on explaining the asynchronous nature of the Lambda invocation from AWS IoT Rules Engine, and the impact it can have on your IoT application, potentially leading to message loss. We will also walk through a working set-up of some resilience steps, using AWS SAM, and AWS Lambda PowerTools for Typescript, and AWS X-Ray for observability.
Asynchronous Lambda invocations
When AWS IoT Rules Engine invokes AWS Lambda, the invocations are asynchronous. This means that AWS Lambda places the event in a queue and AWS IoT Rules records success, regardless of what happens next in your Lambda execution. A separate process reads events from the queue and sends them to your configured function.
Your AWS Lambda function consists of both your own code and the Lambda runtime itself, and both can be a source of errors that you should handle. So what happens if your function execution leads to an error, be it an error in the function’s code, or a runtime error, such as a timeout?
Well, in error case, as described here, AWS Lambda by default will attempt to run the function with the event 2 times more, with 1 minute wait between the first 2 attempts, and 2 minutes delays before the second and third attempt.
Additional situations can occur:
- Your function does not have enough concurrency available to process all incoming events, which can well happen in an IoT application at scale. As a result, additional requests will get throttled (429 HTTP status) by the Lambda service, returned to the event queue, and retried as per Lambda strategy. The more events in the queue, the higher the retry interval set by Lambda, and the lower the rate at which it reads events from the queue.
- The same event is received multiple times, due to eventual consistency of the event queue.
- The function cannot keep up with the incoming events, and events are deleted from the queue without being sent to the function.
- With events at scale, if the AWS Lambda event queue is very long, new events might expire before AWS Lambda gets to send them to your function. If events expire or processing fails, AWS Lambda discards them.
So clearly this can lead to data loss in your application.
Can the Error Action help?
The short answer is no, not in this situation.
The Error Action of the AWS IoT Rules Engine is a great tool for errors that prevent AWS Lambda from accepting the invocation, such as lack of permissions for the rule to trigger your function, or if AWS Lambda is not able to add the event to the queue, but it will not help with errors that happen in your Lambda function’s code, or Lambda runtime (like timeouts). Remember the asynchronous nature of the AWS Lambda function trigger from the IoT Rule.
So what can you do?
Luckily, there are a things you can do to ensure that your events do not get lost, and also to try and mitigate the above mentioned situations:
- Ensure that you have enough concurrency to handle invocations. If you allow the default Lambda retries (2), this also means that these invocations will add to your concurrent invocation count, so keep this in mind.
- Make sure your IoT Cloud application can handle duplicate events as required by your design, and don’t let duplicate events catch you by surprise. For example, ensure that your application code is idempotent, and you design for eventual consistency in your data storage.
- You could reduce the number of retries that the Lambda service performs, or discard unprocessed events quicker. Here are the docs on how to do this. And if your application code is making calls to other services, you should build a retry strategy, as well as error handling in your application code, inside your Lambda function.
- You can configure destinations for asynchronous invocation, for both successful and failed events. Destinations are used by Lambda to send your events to other services, as per configuration. Setting up destinations for failed executions in particular allows you to decouple the handling of your failures without data loss, even if you cannot accommodate Lambda service retries or you figure that your errors are not recoverable with retries. You can configure Amazon SQS, Amazon SNS, Lambda or EventBridge as destinations.
An alternative to destinations is configuring a dead-letter queue(part of your functions version-specific configuration) in your function’s configuration. Destinations however are more flexible (they are not locked in when you publish a version of your function), support additional targets, and include details about the function's response in the invocation record.
Using tracing tools like AWS X-Ray, or logging and metrics with Amazon CloudWatch, is great for visibility as to what happens to your AWS Lambda function invocations.
Let’s look at this in an example
The architecture of the example we are building is in the diagram below:
We have an AWS IoT Serverless application using AWS SAM, which is composed of an AWS Lambda function invoked from the Rules Engine on every message published by an IoT device on topic 'device/<thingName>/from-device'
. For the purpose of our example, the Lambda function will simply throw an error.
Here is the AWS SAM template for this:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
rules-engine-to-lambda
Sample SAM Template for rules-engine-to-lambda
Globals:
Function:
Timeout: 60
Tracing: Active
Resources:
RulesHandlingFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: app/
Handler: app.lambdaHandler
Runtime: nodejs14.x
Architectures:
- x86_64
Tracing: Active
Environment:
Variables:
POWERTOOLS_SERVICE_NAME: RulesToLambdaService
POWERTOOLS_METRICS_NAMESPACE: rules-engine-to-lambda
LOG_LEVEL: INFO
EventInvokeConfig:
DestinationConfig:
OnFailure:
Type: SQS
Destination: !GetAtt FailedRequestsQueue.Arn
MaximumEventAgeInSeconds: 120
MaximumRetryAttempts: 1
Events:
IoTLambda:
Type: IoTRule
Properties:
AwsIotSqlVersion: 2016-03-23
Sql: SELECT * FROM 'device/+/from-device'
Metadata:
BuildMethod: esbuild
BuildProperties:
Minify: true
Target: "es2020"
EntryPoints:
- app.ts
FailedRequestsQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: "FailedRequestsQueue"
VisibilityTimeout: 300
KmsMasterKeyId: alias/aws/sqs
Outputs:
RulesHandlingFunction:
Description: "RulesHandlingFunction ARN"
Value: !GetAtt RulesHandlingFunction.Arn
RulesHandlingFunctionIamRole:
Description: "Implicit IAM Role for RulesHandlingFunction"
Value: !GetAtt RulesHandlingFunction.Arn
And the example AWS Lambda function (app.ts), in Typescript. The Typescript function utilizes the Lambda Powertools for Typescript for an easy, descriptor based integration with AWS X-Ray and Amazon CloudWatch.
import { Context } from 'aws-lambda';
import { captureLambdaHandler, Tracer } from '@aws-lambda-powertools/tracer';
import middy from "@middy/core";
import {injectLambdaContext, Logger} from "@aws-lambda-powertools/logger";
const tracer = new Tracer();
const logger = new Logger();
export const lambdaHandler = middy(async (event: any, context: Context): Promise<any> => {
logger.info('Event', event)
throw new Error('This is a test error');
})
.use(captureLambdaHandler(tracer))
.use(injectLambdaContext(logger, {clearState: true}));
If you publish an event from your IoT device on the topic mentioned above, you can have a look in AWS X-Ray. The service map will look like below.
You can see the light brown circle around your Lambda function invocation, showing the 4xx Error. However, if you look at the trace, you will see the Response code is 202, as expected for a successful asynchronous invocation. As a reminder here, setting up an Error Action will not help, as from the perspective of your IoT Rule, the AWS Lambda invocation was successful, because AWS Lambda responded with a 202 status code, and there is no reason to invoke the Error Action.
Because we have set MaximumRetryAttempts
to 1
and the MaximumEventAgeInSeconds
to 120 seconds
, if we open the trace we will see that Lambda attempted to invoke the function twice, both times with the same expected failure. If your application is processing large numbers of events, allowing the Lambda service to retry for you, especially with a high value of the MaximumEventAgeInSeconds
, might not be the best strategy, due to the potential loss of newer events. You can also notice the different log entries in Amazon CloudWatch for each invocation, in the image below.
Because we have configured an Amazon SQS queue for failure cases, our event is not lost, but was sent to the SQS queue. The destination is configured with DestinationConfig
in the SAM template. Once the messages land in the SQS queue, you can process them with an idempotent consumer, such as another AWS Lambda function, for example.
Have a look also at the GitHub repository with all the resources.
Conclusion
This post explains the behaviour of the AWS IoT Rules Engine invoking your AWS Lambda function, the asynchronous nature of the invocation, and the impact it can have on your IoT application, especially if it leads to potential data loss.
Tools like AWS X-Ray and the Lambda PowerTools help with tracing visibility for your Lambda invocations. Setting up a destination for failed events is a good strategy to ensure data is not getting lost, and you have the possibility for a decoupled handling of failed events, in a parallel path in your application. Programmatic error handling and retries in your Lambda function application code can also work towards your fallback strategy.
Have a look at the links throughout the blog post, to understand better how asynchronous invocations with AWS Lambda work.
If you find this interesting or have suggestions for future topics, feel free to reach out here or @fay_ette on Twitter or LinkedIn.
Top comments (0)