This post was originally written on my own blog. Check it out here
One of the important considerations when running an application is knowing when there are errors, whether or not the service is busy and understanding how your users are using the application to build the best user experiences and work on delivering more business-value.
In order to do that though, you need to implement some monitoring, alerting and/or analytics tools to build up that picture. Recently, I helped a team which had built an application which was running in containers behind a load balancer. Some of the features they expected to be working, were broken and they were having trouble understanding what and where the problem was.
Running on AWS, we were already logging our application logs to a centralised log store, Cloudwatch. To help them, created a Cloudwatch alarm which fired off an SNS topic. A subscriber to that topic is a lambda. With lambda, the possibilities are limitless in terms what what you want to do with that data. For example, you could store the data in a DynamoDB table, evaluate the notification to decide if it's something that needs action (eg, scaling event), or send a Slack notification.
To start, you need to create an alarm. If you've never done this before, you can use the console user interface to get a feel for what kind of alarm metric you might be interested in. As an example, you could choose to be notified any time your Application Load Balancer responds with a 5XX http status code. Then decide what is the "OK" state, so that your alarm is not constantly in "ALARM" state.
For example, if you want to be notified in a Slack channel if your load balancer responds to the client with a 500-type error any time, then you might choose a threshold of greater than or equal to 1 data point. Then your "OK" state would be no data points in the same time frame, because you may want to consider no data points for a http reponse in the 500 range to be a good thing.
Once your alarm is set, you can choose to have Cloudwatch send a message to the SNS topic of your choosing. So now you can setup your topic and subscribers.
The next thing to do would be to create the lambda which sends the POST http request to Slack with the contents of the Cloudwatch alarm message. If you want to write your own, go for it! Alternatively you're more than welcome to use my open sourced version of a typescript lambda I wrote yesterday. Have a look here: https://github.com/jgunnink/cloudwatch-sns-slack-notifier. You'll need the Serverless framework installed, the Slack endpoint URL you're going to be sending the message to and permissions in your AWS account to deploy the lambda.
With the lambda deployed, the SNS topic created and the Cloudwatch alarm set, it's time to wire it all together.
- Choose the lambda as a subscriber to the SNS topic you've created.
- Choose the SNS topic in your Cloudwatch alarm
You now have the beginnings of using monitoring and alerting on AWS with your infrastructure to better understand how your application is performing. Before you go, I'll leave you with some other examples you may want to consider for your
- Depth of the size of an SQS queue. If your application is processing a lot of messages, and it's not able to keep up you could decide to scale and bring up more infrastructure to help process the backlog.
- Specific endpoints being hit. If there's a request made to a certain endpoint or path in your app, for example someone just made a purchase you may want to store the details of your customer in a "Paying customers" table in DynamoDB so you can refer to it later and reach out from a customer service perspective.
- Average CPU utilisation of your scaling group - if too high, you could bring up additional nodes to meet demand.
Would love to hear any more examples you've got or how this could help you. Tweet at me with your ideas!