One of my jobs as a Site Reliability Engineer(SRE) is to ensure that our application has a solid monitoring system. You cannot guarantee reliability if you don't know when things are broken. Because of this fact, one of the projects our SRE team added to its list when it formed was to overhaul our monitoring system at Kenna.
In the Beginning...
Kenna's monitoring setup was a disaster. In retrospect, at least we had something, but it was far from ideal. Here is what we had going to monitor our infrastructure.
- New Relic for performance monitoring
- PagerDuty for application health monitoring
- Elastalert which used logs to alert on data discrepancies or site use anomalies
- Cron jobs that ran nightly and every 30 min looking for data anomalies
- Honeybadger for application/code errors
- Admin dashboards for background processing services like Sidekiq and Resque
The disaster doesn't end there! Not only did we have 6 different tools doing the monitoring, we had them reporting to all different places.
- Slack Channels - At our worst we had a different slack channel for every individual environment with alerts being sent to it.
- SMS messaging
- Phone Call
As if all of those different alerting mediums weren't enough to make your head spin, the alerts we sent to all of them were incredibly inconsistent. Some alerts just reported data, but required no action. Many alerts would go off periodically and be false positives. And finally, some of the alerts actually needed someone to address them immediately.
Needless to say, those who were on-call were miserable! They had no idea what was important or what alerts were actionable. This was not a huge problem at first because most of our team had been around for a while and knew all the ins and outs of what alerts were relevant. However, as our team started to grow, we realized our monitoring system needed to change. Our newly minted SRE team quickly decided one of the first problems we were going to tackle was monitoring.
We overhauled our entire system over the course of a few months and the changes have paid off in spades. Here are the strategies we implemented that made a huge difference for our team.
Consolidate Monitoring To a Single Place
Everything has to be in one place. This is especially important the larger your team gets. As more and more people join, it will be harder to on board them if you have to teach them multiple different systems. Instead, teaching everyone a single system is much simpler. Then, when someone goes on-call, it's infinitely easier to tell them to open up a single webpage and that's it. You can have multiple reporting tools, but you need to send all their alerts through a single interface.
I'm sure you are wondering what the one place we consolidated to was. It was Datadog. The reason we choose Datadog was because it could literally hook into EVERY other service we had, which meant everything could live in one place. We hooked it up to Honeybadger to track application errors, we hooked it into CircleCi to alert us of deploy failures. You name it, we run it through Datadog now. When someone goes on call, all they simply had to do is open up the Datadog monitoring page to know the state of the application.
Now, Datadog is not your only option. There are lots of other companies doing similar things to Datadog that you can look into. Here are a couple links to other recommendations.
You could also just hand roll your own. We considered this at one point because we had already created so many of our own alerting tools in our application. However, we decided against it because we did not have the time or resources to do it right.
Make ALL Alerts Actionable
The moment you let one piece of noise through you set a precedence for everything else to be ignored. I cannot stress this point enough! Once you start letting false positives be ignored, you can very quickly forget what is important and what is not. If an alert goes off and there is no action to be taken, then that alert should not have gone off in the first place. If you want things to alert that are not actionable, you need to put them in a separate place far away from the actionable items.
For example, one way to accomplish this is with two different Slack channels. You could have a Slack channel for alerts that must be addressed and a second one that is just for status reports. However you choose to do it, make sure action items are separate from their "no action needed" counterparts.
Make Sure Alerts Are Mutable
This was huge for us! A lot of our hand rolled alerts in the beginning would trigger every 30/60/90 minutes. Even if we had acknowledged the alert and were working to fix it, it would still ping us. Nothing is more frustrating than trying to fix a problem while an alarm is blaring in your ear.
We first tried to solve this problem by making our hand rolled alerts mutable. This kinda worked, but had to be done through a console and the commands were not intuitive. In addition, we had alerts from tools like Elastalert that we couldn't mute. Given all the inconsistencies across all our tools, it was a breath of fresh air when we gained that functionality with Datadog.
Not only do you want alerts to be mutable, ideally, you want to be able to mute them for a specific timeframe. Nothing is worse than muting an alert, fixing the problem, and then forgetting to unmute it afterwards. This can, and likely will, lead to missed alerts at some point.
Track Alert History
This is one of those things you don't think about until you are staring at an alert and have no idea what is causing it. A lot of times, in order to figure out the cause of an alert you need to know what the previous behavior was. If you have history for an alert you can do this. By going back and looking for trends in data, you can get a better picture of the situation, which can help when it comes to finding the root cause.
Having alert history can also help you spot trends and find problems even before an alert is triggered. For example, let's say you are tracking database load. If you suddenly experience a large amount of growth, you can refer to your monitoring history for that alert to gauge what the load on the database is and if you are approaching that alert threshold. You can then use this information to get ahead of the alert before it even goes off.
Overhauling our monitoring system has paid off in many ways. For starters, on-call developers are a lot happier! By removing any ambiguity around what alerts were important and what weren't, we took a lot of confusion out of being on-call. We also removed a lot of noise. No one wants their phone buzzing all night long when they are on-call. Removing those false positives fixed this issue.
Since all of the monitoring is now in a single place, it is straightforward and easy for developers to understand and learn. This ease of use has caused a lot of developers to contribute to the alerting effort by making their own alerts and improving on the ones we already have in place. Having a reliable, easy to use system gave developers a good reason to buy into it and join the effort to improve it.
I hope that as you are building your own alerting and monitoring systems you keep these strategies in mind to help you build something that is enjoyable for everyone to work with!
Top comments (12)
Hi Molly, nice article! Consolidating monitoring tools is definitely a plus and for the on call people probably even something to preserve their sanity :D
On the note of consolidation and observability of a distributed system, have you had the chance to take a look at Honeycomb? I'd love to read your take on it.
I just looked at Honeycomb for the first time and it looks pretty nice! Couple things I consider when choosing a monitoring solution
Thanks for the reply!
NewRelic has a weird pricing models as well.
About HoneyComb, you should follow its CEO regardless, she's a very interesting voice in the observability/monitoring landscape: twitter.com/mipsytipsy
I think its Ruby client library is a still raw (they are a Go shop IIRC) but the idea behind it is super solid.
Oh yeah, I do follow her!!! She seems pretty awesome 😃
Good post. I have this issue, too... there are too many alerts that show up that are false positives, which causes us to not trust any of the alerts.
I'm going to check out DataDog. I've tried to use Azure Application Insights for this, but it gets expensive really quickly.
Nice post !
I'm happy to see we're not the only ones to use DataDog for our monitoring. We use it as a central console, and with a few Twilio Webhooks we are now able to have customized Voice Call & SMS alerting for whoever is on-call.
But still, you have to be careful and manage closely your DataDog setup, otherwise it will end up like all the others: clogged and false-alerting for pretty much everything. To avoid that, we conduct monthly review of the alerts, and tune them so they are most accurate possible :-)
So true, you definitely have to stay on top of all your alerts. I like the idea of a monthly review, I will keep that in mind. It is nice though when you do that monthly review you only have one place you have to go 😃
Good to see yet another SRE team taking ownership of monitoring! At HelloFresh we were into the same scenario, actually we had tons of infrastructure and product services w/out any sort of monitoring.
With our move to k8s, anything that runs on top if can leverage system metrics (CPU, Memory, Network etc...). Services whose expose HTTP endpoints at the k8s edge (ingress) can have RED metrics (Req, Err Duration) automatically. Since edge metrics are common we were able to automate away dashboards by creating one general allowing ppl to filter by service name. Automating alerts were also possible.
We are truly believers that w/out monitoring software ownership is not possible. Now on, incidents are much faster to be detected (MTTD) and recovered (MTTR).
We tune alerts religiously, TBH I don't even know how we could be flying w/out the monitoring we have nowadays
Right?! Once you have a good monitoring system in place its hard to envision life without it!
Some very good points on why having detailed visibility into an application is important, thanks Molly!
Even we were exploring Datadog which is comprehensive indeed but quickly ends up being very expensive since it bills one on the number of hosts / APMs.
For small / medium scale organizations, cost is a major factor so we eventually discovered AppSignal that bills you only on number of requests (both web & background) than hosts with very effective alerting & a 30 day retention plus their pricing is very affordable. They also have lots of 3rd party add-ons integrated right out of the box plus you can always add custom webhooks should you need more.
The downside? It's currently available only for Ruby / Elixir apps.
PS: They've a fantastic sales & engineering team who'll help you with any kind of queries that's worth checking out.
Awesome! Thanks for the alternative suggestion and insight 😊
I've built a website monitoring tool that might be of interest to some of you, take a look here