DEV Community

Vadym Kazulkin for AWS Community Builders

Posted on • Updated on

Amazon DevOps Guru for the Serverless applications - Part 6 Continuing with anomaly detection on Lambda invocations

Introduction

In the 1st part of the series we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it and explain how to configure it. We also need to go through all the steps in the 2nd part of the series to set everything up. In the subsequent parts we saw DevOps Guru in action detecting anomalies on DynamoDB, API Gateway and in the last article Lambda. In this part of the series we'll continue with anomalies on the Lambda function occurring especially in conjunction with other AWS Services.

Anomalies with Lambda polling the SQS queue

Let's enhance our architecture so that in case of creation of the new product we send the message to the SQS queue and have another Lambda which polls that queue (and then as an example informs the financial department to send the price for the newly created product).

Image description

Now let's imagine that the polling Lambda runs into some kind of error (timeout or runtime error processing the SQS payload). With that, the message remains in the SQS queue, the polling Lambda will retry to read from the queue according to the retry policy always running into the error. The message will remain in the queue until it will be expired or we reach the maximum number of Lambda retries and leave the message unprocessed (or place it in Dead-Letter Queue)

Image description

After setting up such scenario we'd like to figure out whether DevOps Guru will detect such an anomaly and it will. We see the high severity Lambda Error anomaly being recognized.

Image description

Digging deeper into the "Aggregated metrics"

Image description

and "Graphed anomalies"

Image description

we see that besides "Errors Sum" on the CreatedProduct Lambda function DevOps Guru recognized other deviating metrics on the "product-created" SQS queue like "NumberOfMessageRecievedSum" and "ApproximateNumberOfMessageNotVisibleSum" which both indicate that there are unprocessed messages for a long period of time in the
SQS queue with the name "new-product-created".

Anomalies with Lambda polling the Kinesis Data Streams

Let's imagine we have a use case where Lambda polls Kinesis Data Streams for the sake of storing the data in S3 bucket to analyze it with Athena or Quick Sight.

Image description

Now let's imagine the the polling Lambda runs into some kind of error (timeout or runtime error processing the Kinesis Stream) similar to the previous use case with SQS. With that the message remains in the Kinesis Data Streams the polling Lambda will retry to process the Kinesis Data Streams record according to the retry policy always running into the error. With that the message will remain in the Kinesis Data Streams until it will be expired or we reach the maximum number of Lambda retries and leave the Kinesis Data Streams record unprocessed (or that use cases there is failure destination on the Kinesis Streams).

After setting up such scenario we'd like to figure out whether DevOps Guru will detect such an anomaly and it will. We see the medium (I'd personally rate it as high) severity Lambda Error anomaly being recognized.

Image description

Digging deeper into the "Aggregated metrics"

Image description

and "Graphed anomalies"

Image description

we see that besides "IteratorAge Maximum" on the OrderedProduct Lambda function DevOps Guru recognized other deviating metrics on the Kinesis Data Streams like "GetRecords.Byte Sum" and "GetRecords.Records Maximum" which both indicate that there are unprocessed Kinesis Data Streams record(s) for a long period of time.

Anomalies with Step Functions invoking Lambda

Now let's imagine another use case with Step Functions calling Lambda function as a part of some task.

Image description

Now this Lambda runs into some kind of error (timeout or runtime error processing the payload) similar to the previous use cases with SQS and Kinesis Data Streams. This continues until we reach the maximum number of Lambda retries that we have configured in the Task Retry policy of our State Machine.

Also this high severity anomaly was correctly recognized by the DevOps Guru.

Image description

Digging deeper into the "Aggregated metrics"

Image description

and "Graphed anomalies"

Image description

we see that besides "Errors Sum" on our Lambda function DevOps Guru recognized other deviating metrics on the Step Functions like "Executons.Failed Sum" and "Executons.Failed Sum" that the number of failed and in the end aborted executions significantly increased for a certain period of time.

Anomalies Lambda communicating directly to RDS service

Now let's imagine the scenario that we use RDS with PostgreSQL instead of DynamoDB.

Image description

And we don't use neither RDS Proxy nor Aurora Serverless (Data API). With that there is a risk of exhausting the database connections having more Lambda functions running in parallel and connecting to the database than the maximum number of database connections that RDS provides. For the sake of cost saving we'll use the smallest possible RDS instance db.t3.micro which provides you 83 database connections maximum. For your instance size please check "maximum number of database connections". For example for the PostgreSQL RDS instance execute the following: "show max_connections".

Now let's do some stress test with hey tool :

hey -q 50 -z 15m -c 20 -H ""X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/productsFromRDS/2   

Enter fullscreen mode Exit fullscreen mode

With this (sending 50 requests per second per container with 20 containers in parallel for 15 minutes) we'll quickly exhaust all database connections.

DevOps Guru correctly recognizes such high severity anomaly.

Image description

Digging deeper into the "Aggregated metrics"

Image description

and "Graphed anomalies"

Image description

we see that besides "5XXError Average" on our API Gateway DevOps Guru correctly recognized deviating metric "DatabaseConnections Sum" on the RDS. Strangely but although our Lambda function GetProductFromRDSByID also threw an error there was no "Errors Sum" deviating metric detected.

Conclusion

In this article we described how DevOps Guru was able to detect different kind of anomalies on Lambda like polling the SQS queue and Kinesis Data Stream, but also Step Function invoking Lambda function running into the error. Also we set up the test to detect the anomaly in Lambda function communicating directly with RDS which runs out of database connections. In the next part of the series we'll look into another DevOps Guru functionality called "Proactive Insights".

Top comments (0)