Amazon DevOps Guru for the Serverless applications - Part 10 Anomaly detection on Aurora Serverless v2

#aws #serverless #devops #aiops

Introduction

In the course of the series about Amazon DevOps Guru for the Serverless applications we explored many different anomalies recognized by the DevOps Guru. For my other article series about Data API for Amazon Aurora Serverless v2 I experimented a lot with Aurora Serverless v2 and of course asked myself a question whether anomalies with Aurora (Serverless v2) PostgreSQL database will be recognized by Amazon DevOps Guru service. So, let's explore this. In this article we'll use AWS SDK for Java and focus on the "standard" way to connect Lambda function to the Aurora (Serverless v2) PostgreSQL which is JDBC (Java Database Connectivity). We won't use Amazon RDS Proxy in our example. The same will be true for other programming languages supported by Lambda as they all offer their only functionality for that. We'll explore the anomaly detection on Aurora Serverless v2 with Data API in one of the next articles.

Anomaly detection on Aurora Serverless v2

Let's look into our sample application and use SAM template to create infrastructure and deploy the application which architecture is described on the following picture :

The application creates products stored in the Aurora Serverless v2 PostgreSQL database and retrieves them by id using JDBC. The relevant Lambda function which we'll use to retrieve product by its id is GetProductByIdViaAuroraServerlessV2WithoutDataApi and its handler implementation is GetProductByIdViaAuroraServerlessV2WithoutDataApiHandler.

DevOps Guru suggests enabling RDS Performance Insights to gain additional insights about the anomaly which we did for AuroraServerlessV2Instance in the SAM template.

  AuroraServerlessV2Instance:
    Type: 'AWS::RDS::DBInstance'
    Properties:
      Engine: aurora-postgresql
      DBInstanceClass: db.serverless
      DBClusterIdentifier: !Ref AuroraServerlessV2Cluster
      ....
      EnablePerformanceInsights: true
      PerformanceInsightsRetentionPeriod: 7

As in the previous article we use hey tool to perform the load test like this

hey -z 15m -c 300 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-api.eu-central-1.amazonaws.com/prod/productsWithoutDataApi/1

In this example we invoke the API Gateway endpoint with 300 concurrent containers for 15 minutes. Behind the prod/productsWithoutDataApi endpoint Lambda function GetProductByIdViaAuroraServerlessV2WithoutDataApi will be invoked which retrieves the product by id 1 from the Aurora Serverless v2 PostgreSQL database.

We configured in our SAM template Aurora database cluster to scale from minimal capacity 0.5 to maximal capacity 1 ACU (which is very small database size) in case of the increased load for the cost saving purpose only.

  AuroraServerlessV2Cluster:
    Type: 'AWS::RDS::DBCluster'
...
      ServerlessV2ScalingConfiguration:
        MinCapacity: 0.5
        MaxCapacity: 1

Aurora (Serverless v2) database manages the maximal number of the database connections available proportionally to the database size (in our case the ACU setting). For more information please read the documentation about Maximum connections for Aurora Serverless v2. So with the increased number of invocations we expect to reach the maximal number of the database connections available and high database (CPU) load soon, so that database won't be able to respond to the new Lambda function requests to retrieve product by id. With that we will provoke the anomaly and would like to figure out whether DevOps Guru will be able to detect it. And it will! The following insight was generated:

And the following aggregated anomalous metrics have been identified:

For the Aurora Serverless v2 database these are: DatabaseConnection Sum, DBLoad Average and DBLoadCPU Average.

Graphed anomalies look like this:

And the RDS Performance Insights metrics for database load sliced by the SQL statement look like this:

So we see that the SQL statements :

select id, name, price from tbl_product where id=?

with colors purple and orange caused very high database load.

Conclusion

In this article learned that DevOps Guru could successfully detect anomalies with Aurora (Serverless v2) PostgreSQL database in case of Lambda function with Java 21 managed runtime connected to it via JDBC. We created a very high load on the database by invoking Lambda function to retrieve product by id several hundred times concurrently for multiple minutes. By doing this Aurora (Serverless v2) PostgreSQL database auto-scaled from 0.5 to 1 ACU which was not enough to sustain such a load. We saw that DevOps Guru correctly pointed to the increased sum of database connections and constantly high database (CPU) load operational anomalies.

Top comments (1)

Jason Dunn [AWS] AWS Community Builders • May 21 '24

A ten part series? Impressive!