DEV Community

Vadym Kazulkin for AWS Community Builders

Posted on

Amazon DevOps Guru for the Serverless applications - Part 7 Proactive insights

Introduction

Starting from the 1st part of the series we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it and explain how to configure it. We also need to go through all the steps in the 2nd part of the series to set everything up. In the subsequent parts we saw DevOps Guru in action detecting anomalies on DynamoDB, API Gateway and Lambda (also Lambda in conjunction with other AWS services like SQS, Kinesis, Step Functions and RDS). These all have been so called DevOps Guru “Reactive Insights” detecting live anomalies. DevOps Guru also supports another kind of anomaly detection called “Reactive Insights” which we’d like to introduce in this part of the series.

DevOps Guru “Reactive Insights” overview

DevOps Guru “Reactive Insights” provides information about misconfiguration of the monitored services in terms of the violated operational best practices and insights about overprovisioning of the used services. The information can be provided after a short period (between 1 hour and several days) after DevOps Guru started to monitor configured resources. For the sake of the overview, we’ll use our application that we introduced in the first part of the series. Here is how the “Reactive Insights” are displayed.

Image description

You see the information about the name of reactive insight, its status, severity and created time.

Now let's go over the findings that DevOps Guru was able to detect for our sample application. Some of them have been intentionally produced in order to see whether DevOps Guru will be able to detect them.

Let’s start with “misconfiguration of the used services” category insights:

  • Lambda timeout exceeds recommended SQS visibility.

Image description

This happens when the “Default visibility timeout” of the SQS queue (for example 30 seconds) is lower as the timeout of the poller Lambda function (for example 35 seconds) and we risk to lose the message in the SQS queue.

DevOps Guru also provides instructions how to remediate this.

Image description

  • SQS triggered Lambda does not have a DLQ

Image description

With a clear recommendation on how to fix this problem.

Image description

  • Lambda function consuming DynamoDB and Kinesis stream without failure destination.
    Similar insights as above will be provided in case our Lambda function consuming DynamoDB or Kinesis stream will be configured without failure destination.

  • Lambda function doesn’t have enough subnets.
    This insight was detected for the Lambda function which communicated directly to RDS (see part 6 of the series). In this case Lambda had to be put in VPC in order as communication with RDS requires it. I intentionally only used 2 available subnets in the AWS region, but the region (in this case eu-central-1) has 3 availability zones and therefore 3 subnets available.

Let’s continue with "insights about under and over-provisioning of the used services" category insights.

  • DynamoDB table read are underutilized.
    DevOps Guru was able to detect such insights in both cases: when I used “provisioned throughput” mode on DynamoDB of 3 RCUs/WCUs, but the application constantly received less. In this case DevOps Guru suggested me either to switch the DynamoDB table to the “on-demand” capacity mode or to reduce the number of RCUs/WCUs.

  • Missing activation of Point-in-time recovery for DynamoDB. DevOps Guru was able to detect such insight.

  • Lambda function has concurrency spillover.
    This insight was detected as I used provisioned concurrency of 5 for the test purposes for one of Lambda functions and then did a test with 9 requests in parallel via API Gateway so the number of provisioned Lambda environments wasn’t enough to serve all or most requests without the cold starts.

Image description

DevOps Guru recommended me in this case to increase the number provisioned concurrency to 8.

Image description

Unfortunately DevOps Guru wasn’t able to generate me an insight in case I configured provisioned concurrency for one of my Lambda functions but didn’t invoke it for more a long period of time clearly wasting money.

Conclusion

In this article we looked into the DevOps Guru “Reactive Insights” which provides detailed information about misconfiguration of the monitored services in terms of the violated operational best practices and insights about under and over provisioning or utilizing of the used services. We saw that the insights were mostly generated correctly, only missing several insights like overprovisioned Lambda concurrency. In the next part of the series we’ll look at the DevOps Guru capabilities in terms of integrations into other systems. To suc belongs the incident management tools or services like AWS Systems Manger own OpsCenter but also 3rd party tools like Atlassian Opsgenie and PagerDuty with the more detailed example how to use the latter.

Top comments (0)