DEV Community

Cover image for Generating cloudwatch alarms using 'metric math' via CloudFormation and Terraform.
Simon Hanmer for AWS Community Builders

Posted on • Edited on

Generating cloudwatch alarms using 'metric math' via CloudFormation and Terraform.

I spend a lot of time working as a consultant with GlobalLogic UK&I with different client teams to deploy AWS infrastructure, and not surprisingly, I see differing levels of maturity and experience within these teams.

While we work with teams with a lot of knowledge, often they concentrate on deploying the applications and infrastructure, but they won't think about how they can understand how well an application is working. This is an important aspect of working within the Cloud, usually termed monitoring or observability.

There are several well-known 3rd party tools, such as Splunk or DataDog, but people often forget there is a handy set of tools provided by AWS under the CloudWatch banner. CloudWatch provides a range of services, such as exposing metrics, linking them to alarms, and X-ray services, to allow us to understand the performance of services in more depth.

CloudWatch metrics

Most services that we can use within AWS, such as EC2, Lambdas, DynamoDB, etc., will emit metrics that we can use to understand how well a resource is performing.

For example, if we're deploying EC2 instances, we can monitor many aspects of the instance, such as

  • CPU utilisation,
  • Disc read and write metrics,
  • Network traffic.

While we're working with lambda functions, we can monitor

  • Number of invocations, including how many concurrent invocations,
  • The duration of a function run,
  • How many errors are generated?
  • How many invocations were throttled (for example, if we exceeded the allowed number of lambda invocations in an account)?

Using the console, we can view historical graphs of our metrics, as shown below:


Cloudwatch metrics graph for a lambda function

CloudWatch metrics graph for a lambda function

This graph shows the metrics for a lambda function MyLambdaFunction, specifically how many times the function has run (invocations), along with how long the lambda takes to run (duration).

CloudWatch Alarms

Once we're aware of the metrics available for a particular resource, we can use them to set alarms. For example, we might want to be notified if the CPU usage on one of our EC2 instances went above 75% or if one of our lambdas was being throttled. Alarms have associated actions that are triggered when a metric goes into an alarm state or when it returns to normal, and we can, for example, send a message to a SNS topic, trigger a lambda to resolve something, or if we're using an EC2 metric, perform EC2 actions such as stopping or rebooting the instance.

If we're graphing a metric through CloudWatch, we'll have the option to create an alarm directly or create one from the console. When we create the alarm, we'll be asked for the following:

  • A name and description
  • Which namespace we are interested in, such as EC2 or Lambda
  • Which metrics do we want to monitor, i.e., CPU utilisation, throttle, etc.?
  • What is the dimension for the metric, i.e., which instance, DynamoDB table, function name, etc.?
  • How we'll measure the metric, such as using averages, sums, or minimum or maximum values. The AWS documentation for each service will explain the best statistics to use; for example, the Lambda documentation can be found here.
  • What period do we want to monitor, and how many anomalies do we need to trigger the alarm?
  • What the threshold is for triggering the alarm—this is a value and condition such as GreaterThanThreshold or LessThanThreshold.
  • What actions should be triggered when an alarm occurs or when it returns to its normal state?

Calculated alarms using metric math.

Generally, we want to alarm on the absolute value of a metric, for example, when CPU utilisation exceeds a certain amount or if a lambda is being throttled. However, sometimes we'll be more interested in a relative value—maybe the percentage of runs that generate errors. A single error is important if a lambda runs once or twice, but much less so if the lambda runs thousands of times a day.

Where this isn't available as a metric, we can often calculate them using what is known as metric math. So for example, with the example above, we can calculate the number of errors as a percentage of invocations

Rather than looking at the absolute value of the Error metric as in


Alarm configuration using absolute values

Alarm configuration using absolute values

we can use expressions to calculate a value


Alarm configuration using expressions

Alarm configuration using expressions

Deploying Alarms via Infrastructure as Code.

Of course, best practices today dictate that we should be deploying our infrastructure as code, using tools such as CloudFormation or Terraform.

Note: The examples below are not full deployments, they only show how to create alarms.

Deploying an alarm with absolute values using CloudFormation

The yaml CloudFormation below will deploy an alarm which will trigger when a lambda is throttled.

In this case, the definition is reasonably simple as we are looking at a single metric, in this case the Throttles metric in the AWS/Lambda namespace.

It depends on the following references

Variable Name Description
LambdaFunctionName The name of the deployed Lambda
LambdaThrottlePeriod What period should we use when assessing if a throttle has occurred
LambdaThrottleThreshold How many throttles do we need to see before triggering the alarm
aws_sns_topic.ErrorAlarmTopic The SNS Topic to send alarms to


rLambdaThrottledAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "lambda-function-${pLambdaFunctionName}-throttled"
AlarmDescription: !Sub "lambda-function-${pLambdaFunctionName} is throttled"
MetricName: Throttles
Namespace: AWS/Lambda
Dimensions:
- Name: FunctionName
Value: !Ref pLambdaFunctionName
Period: !Ref pLambdaThrottlePeriod
Statistic: Sum
EvaluationPeriods: 1
Threshold: !Ref pLambdaThrottleThreshold
ComparisonOperator: GreaterThanThreshold
TreatMissingData: ignore
AlarmActions:
- !Ref rAlarmTopic
OKActions:
- !Ref rAlarmTopic

Enter fullscreen mode Exit fullscreen mode




Deploying an alarm with absolute values using Terraform

The Terraform below will deploy an alarm which will trigger when a lambda is throttled. It depends on the following variables and resources:

Variable Name Description
LambdaFunctionName The name of the deployed Lambda
LambdaThrottlePeriod What period should we use when assessing if a throttle has occurred
LambdaThrottleThreshold How many throttles do we need to see before triggering the alarm
AlarmTopic The SNS Topic to send alarms to


resource "aws_cloudwatch_metric_alarm" "LambdaThrottledAlarm" {
alarm_name = "lambda-function-${var.LambdaFunctionName}-throttled"
alarm_description = "lambda-function-${var.LambdaFunctionName} is throttled"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
metric_name = "Throttles"
namespace = "AWS/Lambda"
period = var.LambdaThrottlePeriod
treat_missing_data = "ignore"
statistic = "Sum"
threshold = var.LambdaThrottleThreshold
alarm_actions = [aws_sns_topic.AlarmTopic.arn]
ok_actions = [aws_sns_topic.AlarmTopic.arn]

dimensions = {
FunctionName = var.pLambdaFunctionName
}
}

Enter fullscreen mode Exit fullscreen mode




Deploying an alarm with calculated values using CloudFormation

This time, the template is more complex. We cannot simply add the metrics to be measured, but we also need to include the expression that will be calculated and use that as the trigger for the alarm.

Note: We use two intermediate metrics, errors and invocations which we then reference in the expression. Both of the intermediate metrics are tagged with ReturnData: false to indicate they are intermediates.



rMyLambdaErrorsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "lambda-function-${pLambdaFunctionName}-errors"
AlarmDescription: !Sub "lambda-function-${pLambdaFunctionName} errors exceed ${pLambdaErrorThreshold} %"
Metrics:
- Id: error_percentage
Label: Errors
Expression: (errors/invocations)*100
- Id: errors
Label: input
ReturnData: false
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: !Ref pLambdaFunctionName
Period: !Ref pLambdaErrorPeriod
Stat: Sum
Unit: Count
- Id: invocations
Label: input
ReturnData: false
MetricStat:
Metric:
Namespace: AWS/Lambda
MetricName: Invocations
Dimensions:
- Name: FunctionName
Value: !Ref pLambdaFunctionName
Period: !Ref pLambdaErrorPeriod
Stat: Sum
Unit: Count
EvaluationPeriods: 1
Threshold: !Ref pLambdaErrorThreshold
ComparisonOperator: GreaterThanThreshold
TreatMissingData: ignore
AlarmActions:
- !Ref rErrorAlarmTopic
OKActions:
- !Ref rErrorAlarmTopic

Enter fullscreen mode Exit fullscreen mode




Deploying an alarm with calculated values using Terraform

Again, this is more complex than the Terraform used to deploy the alarm based on an absolute value, because again we need to use an expression to calculate the value for the alarm to monitor.



resource "aws_cloudwatch_metric_alarm" "MyLambdaErrorsAlarm" {
alarm_name = "lambda-function-${var.LambdaFunctionName}-errors"
alarm_description = "lambda-function-${var.LambdaFunctionName} errors exceed ${var.LambdaErrorThreshold} %"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
threshold = var.LambdaErrorThreshold
treat_missing_data = "ignore"
alarm_actions = [aws_sns_topic.AlarmTopic.arn]
ok_actions = [aws_sns_topic.AlarmTopic.arn]

metric_query {
id = "error_percentage"
expression = "(errors/invocations)*100"
label = "Errors"
return_data = true
}

metric_query {
id = "errors"
metric {
metric_name = "Errors"
namespace = "AWS/Lambda"
period = var.LambdaErrorPeriod
stat = "Sum"
unit = "Count"

  <span class="nx">dimensions</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">FunctionName</span> <span class="p">=</span> <span class="kd">var</span><span class="p">.</span><span class="nx">LambdaFunctionName</span>
  <span class="p">}</span>
<span class="p">}</span>
<span class="nx">return_data</span> <span class="p">=</span> <span class="kc">false</span>
Enter fullscreen mode Exit fullscreen mode

}

metric_query {
id = "invocations"
metric {
metric_name = "Invocations"
namespace = "AWS/Lambda"
period = var.LambdaErrorPeriod
stat = "Sum"
unit = "Count"

  <span class="nx">dimensions</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">FunctionName</span> <span class="p">=</span> <span class="kd">var</span><span class="p">.</span><span class="nx">LambdaFunctionName</span>
  <span class="p">}</span>
<span class="p">}</span>
<span class="nx">return_data</span> <span class="p">=</span> <span class="kc">false</span>
Enter fullscreen mode Exit fullscreen mode

}
}

Enter fullscreen mode Exit fullscreen mode




Metric and alarm pricing.

In this post, we're using standard metrics provided by AWS which are provided by AWS without charge.

Alarms are priced per metric, with each metric monitored costing $0.10 per month (depending on region). So our initial, absolute alarm would cost $0.10 per month as it monitors a single metric, whilst the calculated alarm would cost $0.20 per month. There is a free tier where the first 10 alarms per month are free.

Top comments (1)

Collapse
 
rdarrylr profile image
Darryl Ruggles

Great to see more examples using tools like Cloudwatch. People don’t need to send so much money to 3rd party tools.