A key metric for measuring how well you handle system outages is the Mean Time To Recovery or MTTR. It's basically the time it takes you to restore the system to working conditions. The shorter the MTTR, the faster problems are resolved and the less impact your users would experience and hopefully the more likely they will continue to use your product!
And the first step to resolve any problem is to know that you have a problem. The Mean Time to Discovery (MTTD) measures how quickly you detect problems and you need alerts for this - and lots of them.
Exactly what alerts you need depends on your application and what metrics you are collecting. Managed services such as Lambda, SNS, and SQS report important system metrics to CloudWatch out-of-the-box. So depending on the services you leverage in your architecture, there are some common alerts you should have. And here are some alerts that I always make sure to have.
You might have noticed the regional metrics (I know, the dashboard says
Account-level even though its own description says it's "in the AWS Region") in the Lambda dashboard page.
ConcurrentExecutions is an important metric to alert on. Set the alert threshold to ~80% of your current regional concurrency limit (which starts at 1000 for most regions).
This way, you will be alerted when your Lambda usage is approaching your current limit so you can ask for a limit raise before functions are throttled.
You may also wish to add alerts to the regional
Throttles metric. But this depends on whether or not you're using
Reserved Concurrency limits how much concurrency a function can use and throttling excess invocations shows that it's doing its job. But those throttling can also trigger your alert with false positives.
(Note: depending on the function's trigger, some of these alerts might not be applicable.)
Use CloudWatch metric math to calculate the error rate of a function - i.e.
100 * Errors / MAX([Errors, Invocations]). Align the alert threshold with your Service Level Agreements (SLAs). For example, if your SLA states that 99% of requests should succeed then set the error rate alert to 1%.
Unless you're using
Reserved Concurrency, you probably shouldn't expect the function's invocations to be throttled. So you should have an alert against the
For async functions with a dead letter queue (DLQ), you should set up an alert against the
DeadLetterErrors metric. This tells you when the Lambda service is not able to forward failed events to the configured DLQ.
Similar to above, for functions with Lambda Destinations, you should set up an alert against the
DestinationDeliveryFailures metric. This tells you when the Lambda service is not able to forward events to the configured destination.
For functions triggered by Kinesis or DynamoDB streams, the
IteratorAge metric tells you the age of the messages they receive. When this metric starts to creep up, it's an indicator that the function is not keeping pace with the rate of new messages and is falling behind. The worst-case scenario is that you will experience data loss since data in the streams are only kept for 24 hours by default. This is why you should set up an alert against the
IteratorAge metric so that you can detect and rectify the situation before it gets worse.
Even if you know what alerts you should have, it still takes a lot of effort to set them up. This is where 3rd-party tools like Lumigo can also add a lot of value. For example, Lumigo enables a number of built-in alerts (using sensible, industry-recognized defaults) for auto-traced functions so you don't have to manually configure them yourself. But you still have the option to disable alerts for individual functions should you choose to.
Here are a few of the alerts that Lumigo offers:
Predictions - when Lambda functions are dangerously closed to resource limits (memory/duration/concurrency, etc.)
Abnormal activity detected - invocations (increase/decrease), errors, cost, etc. See below for an example.
On-demand report of misconfigured resources (missing DLQs, wrong DynamoDB throughput mode, etc.)
Threshold exceeded: memory, errors, cold starts, etc.
Lambda's runtime crash
Furthermore, Lumigo integrates with a number of popular messaging platforms so you can be alerted prompted through your favorite channel.
Oh, and Lumigo does not charge extra for alerts. You only pay for the traces that you send to Lumigo, and it has a free tier for up to 150,000 traced invocations per month. You can sign up for a free Lumigo account here.
By default, API Gateway aggregates metrics for all its endpoints. For example, you will have one
5xxError metric for the entire API, so when there is a spike in 5xx errors you will have no idea which endpoint was the problem.
You need to
Enable Detailed CloudWatch Metrics in the stage settings of your APIs to tell API Gateway to generate method-level metrics. This adds to your CloudWatch cost but without them, you will have a hard time debugging problems that happen in production.
Once you have per-method metrics handy, you can set up alerts for individual methods.
When it comes to monitoring latency, never use Average. "Average" is just a statistical value, on its own, it's almost meaningless. Until we plot the latency distribution we won't actually understand how our users are experiencing our system. For example, all these plots produce the same average but have a very different distribution of how our users experienced the system.
Seriously, always use percentiles.
So when you set up latency alerts for individual methods, keep two things in mind:
Use the Latency metric instead of IntegrationLatency. IntegrationLatency measures the response time of the integration target (e.g. Lambda) but doesn't include any overhead that API Gateway adds. When measuring API latency, you should measure the latency as close to the caller as possible.
Use the 90th, 95th, or 99th percentile. Or maybe use all 3, but set different threshold levels for them. For example, p90 Latency at 1 second, p95 Latency at 2 seconds, and p99 Latency at 5 seconds.
When you use the Average statistic for API Gateway's
5XXError metrics you get the corresponding error rate. Set up alerts against these to alert yourself when you start to see an unexpected number of errors.
When working with SQS, you should set up alerts against the
ApproximateAgeOfOldestMessage metric for an SQS queue. It tells you the age of the oldest message in the queue. When this metric trends upwards, it means your SQS function is not able to keep pace with the rate of new messages.
There are a number of metrics that you should alert on:
They represent the various ways state machine executions would fail. And since Step Functions are often used to model business-critical workflows, I would usually set the alert threshold to 1.
Yes, you can!
Here at Lumigo, we have published an Open Source project called cloudwatch-alarms-macro to do just that. It's a macro that lets you define organization defaults for alert thresholds and generate alerts for resources you have in your CloudFormation stack.
The quickest way to get started is to deploy the macro to your AWS account via the Serverless Application Repository here. After that, you can follow the examples to configure your default thresholds and override them per stack.
Get a free Lumigo account and set up alerts instantly: Sign up to Lumigo.
Originally published at https://lumigo.io on August 17, 2020.