Introduction
In the 1st part of the series we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it and explain how to configure it. We're also required to go through all the steps in the 2nd part of the series to set everything up to be ready for our experiments. Now it's time to see DevOps Guru anomaly detection on real examples starting with DynamoDB.
Anomaly Detection on DynamoDB
We'll start generating the anomaly by provoking the anomalies on the DynamoDB. For the sake of the experimentation we'll artificially create test cases to provoke such anomalies quickly and with the lowest cost possible. For DynamoDB test case let's set the capacity "ReadCapacityUnits" on the ProductsTable DynamoDB table from 5 to 1 as displayed in the image below.
Our goal is to provoke read throttling on this table more quickly. In the real world scenario your ReadCapacityUnits configured (in case you use "provisioned throughput" mode on the table) can be much higher but of course the throttling may also happen by exceeding it and burning the burst credits.
Now we're ready to run our stress test. For this we execute the following command which retrieves the product by id equals to 1.
hey -q 10 -z 15m -c 9 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/1
Please note, you will need to pass your own YOUR_API_GATEWAY_ENDPOINT generated when deploying SAM template (see the explanation above how to find it out). In this example we run the test for 15 minutes (-z 15m), executing 10 queries per second (-q 10) with 9 containers in parallel (-c 9). 15 minutes duration will be more than enough to burn the entire burst credits on the DynamoDB ProductsTable and throttle it as it only gives us 1 ReadCapcityUnit.
When DevOps Guru recognizes the anomaly (in our case it will happen after 7 to 9 minutes), it generates so called (operational) insight. Let's explore how general dashboard looks like in case there is ongoing reactive insight.
We see that we have "Ongoing reactive insights" and "Ongoing proactive insights" and the DevOpsGuruDemoProductsAPI application stack has been marked as unhealthy. We'll look into the "Ongoing proactive insights" in one of the upcoming articles. Let's explore the "Ongoing reactive insights" first. My clicking on it we can view its details. As we see, it's the "DynamoDB read throttle events" insight as expected.
By removing the filter "status = ongoing" we can also also see the past insights (default is for 6 months)
Now let's dig deeper into "DynamoDB read throttle events" insight by going into the "Insight Overview" page. Here is one of examples of such insight.
Here we see the information of the individual insight like severity (in case of throttling it is of course "high"), status, start and end time. OpsItem ID is provided in case you integrate with the AWS Systems Manager OpsCenter service (also the topic of the one of the upcoming articles). Generally speaking DevOps Guru is anomaly detection service which should be used in conjunction with professional incident management tools or services like AWS internal Systems Manager OpsCenter service or third party tools like PagerDuty (also the topic of the one of the upcoming articles) and Opsgenie from Atlassian.
Below in the page you can find other categories like "Graphed anomalies", "Aggregated metrics" and "Relevant events list". Let's start with "Aggregated metrics". There you can see the metrics and exact time frames were some anomalies have been identified.
We can also group these metrics by "service name" in the right upper corner.
With this we have a much clearer picture that the incident started by the increased number of requests to DevOpsGuruDemoProductAPI (due to the execution of the stress test) which then led to ReadThrottleEvents on the DynamoDB ProductsTable. Increased latency metrics also appeared as a consequence of such throttle events. We can also can take a closer look to the exact values of such metrics in the "Aggregated metrics" category .
Now let's explore the "Relevant events list" category.
We see two kinds of such events : "Infrastructure" and "Deployment" that occured in the past and can potentially be the reason for the incident. With "Deployment" you can verify when you deployed last time and this might be a reason for the detected anomaly. In our particular case the "Infrastructure" events are more interesting as they describe all changes in the configuration and settings in our application stack. If we click on the last grey circle for "Infrastructure" we can see the details of the event.
By clicking on the "ops event" we can view the complete detail of this event in the AWS CloudTrail (which should be of course enabled to capture such events)
We see that it is UpdateTable event on the DynamoDB ProductsTable.
By digging deeper we can see more details and the complete event record as JSON with before and after the change values.
But with CloudTrail alone it requires time to compare the values in the "from" and "to" state. In case AWS Config is enabled for recording, we can click on "View AWS Config resource timeline" as see the difference directly:
With that we see the change for "ReadCapacityUnits" from 5 to 1 and now may link this to the DynamoDB "read throttle events" insight. Of course our example was artificial by manually reducing the RCUs, in other cases the natural increase of number of requests may lead to the same anomaly as well.
The final piece of information with the linked documentation provided by the DevOps Guru is on how to address the anomalies in this insight, see this particular recommendations below.
Conclusion
In this article we described how DevOps Guru was able to detect the anomaly on DynamoDB having read table throttling. Even if our example was a bit artificial, this kind of anomaly may happen in the real world scenarios and also in case auto scaling was enabled on the "Read capacity" by exceeding the "Maximum capacity units". We also went through all the information that DevOps Guru additionally provided us like "Graphed anomalies", "Aggregated metrics", "Relevant events list" and "Recommendations" to analyze and resolve this anomaly. In the next part of this series we'll explore the anomaly detection on the API Gateway.
Top comments (0)