Amazon DevOps Guru for the Serverless applications - Part 4 Anomaly Detection on API Gateway
In the 1st part of the series we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it and explain how to configure it. We also need to go through all the steps in the [https://dev.to/aws-builders/amazon-devops-guru-for-the-serverless-applications-part-2-setting-up-the-sample-application-for-the-anomaly-detection-167) to set everything up. In the 3rd part of the series we saw DevOps Guru in action by generating the anomalies on the DynamoDB and explaining general capabilities of the DevOps Guru service. In this part of the series we'll generate anomalies on the API Gateway.
There are mainly two kind of anomalies that we can experience with API Gateway : HTTP 4XX errors and HTTP 5XX errors. We'll see the latter in action when we provoke Lambda anomalies in the next part of the series. Let's take a look whether DevOps Guru can recognize HTTP 4XX errors as anomalies.
There are several kind of such anomalies. We'll be looking at the following ones:
HTTP 429 "too many requests" to API Gateway where it will throttle requests.
HTTP 404 "not found error" in case we ask for not existing product id.
Let's first look at HTTP 429 error. The easiest way to generate such errors with the lowest cost possible is to set low values to either Request Rate, Burst or Quota to the DevOpsGuruDemoProductAPIUsagePlan associated with our DevOpsGuruDemoProductAPI. Here is the example that we set Quota to 500 requests per day
Now let's do some stress test with hey tool
hey -q 10 -z 11m -c 5 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/1
With this (sending 10 requests per second per container with 5 containers in parallel for 11 minutes) we'll exhaust 500 requests per day on API Gateway pretty quickly and then receive HTTP 429 as an response. DevOps Guru also recognized the anomaly as displayed in the image below.
We see that DevOps Guru is the opinion that such an error has only medium severity (which I personally disagree).
"Aggregated metrics" shows that "4XXError Average" was correctly recognized as a reason for the anomaly. Unfortunately it's the problem of the CloudWatch that it only displays the generic 4XX HTTP Error and not the concrete HTTP 429 error and DevOps Guru simply shows the CloudWatch graphs here. We'll need some help of the CloudWatch Logs to identify the exact error.
And "Graphed Anomalies" shows the exact amount of throttled requests in the time range of the anomaly.
There also some recommendations how to fix this kind of anomaly.
To provoke HTTP 404 "not found error" we have simply to permanently query for not existing product id like
hey -q 3 -z 10m -c 5 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/200
And after several minutes DevOps Guru will recognize this anomaly and create the insight. As CloudWatch doesn't differentiate between HTTP 4XX errors the insight will look exactly like in case of HTTP 429 errors explained above.
Here is the room for improvement as HTTP 404 are application errors and HTTP 429 can be more infrastructural error, so more precise information delivered by CloudWatch/DevOps Guru will lead to much quicker remediation time.
In this article we described how DevOps Guru was able to detect the anomaly on API Gateway by throttling through exceeding the number of requests per day quota (the real world scenario will be to exceed the request and burst quotas) and by query for not existing product id. In the next part of this series we'll explore the anomaly detection on the Lambda.