Reducing alert fatigue starts from your monitoring platform - setting the right thresholds to trigger alerts and understanding which of these are essential to be sent into your on-call platform is a start. This post outlines some of the best practices that help you reduce alert noise and improve your on-call experience.
The word noise implies something unpleasant and unwanted. You combine that with on-call and it adds a factor of annoyance to the already overwhelming process.
And this feeling doesn’t change whether you’re an old hand or just starting with your on-call duties. It’s difficult to stay motivated to be on top of things, especially when you get a ping or your phone rings for an incident that should not have paged you in the first place.
Sometimes, the louder-than-life phone call is just about a “CPU has hit 50% usage” alert that you shouldn’t even be worried about. The frequency of informational alerts can drown out the valid critical ones. This is called Alert Fatigue.
This post outlines the different ways you can follow to minimize alert fatigue to ensure that you don’t get woken up for alerts that can wait.
Reducing alert fatigue starts from your monitoring platform - setting the right thresholds to trigger alerts and understanding which of these are essential to be sent into your on-call platform is a start.
- Setting the right alerts
Collecting metrics is an absolutely important part of improving your observability, and hence reliability. However, just because your sophisticated monitoring / observability platform can monitor 273 parameters, you need not set up alerts on all of them. Set up meaningful alerts which are core to your system reliability and collect the rest as just “not-alerting” data which can be used for preemptive analysis. So, this way, only the alerts that need immediate action should trigger a notification from your on-call tool, while other alerts are just recorded for adding context.
You can always check your SLO metrics, on-call command centre or your monitoring dashboard (as frequently as it makes sense to you) to check the overall health of the system.
- Setting the right threshold
Even if you set up the right alerts, you end up getting a lot of alerts only to see that it has gone back to normal quickly (mostly due to flapping) or due to a temporary spike in user behaviors during the most active part of the day.
In these cases, you can observe the behavior for a while and modify the threshold value to trigger alerts which are slightly higher than the usual flapping values. Let's say you have set up some CPU alerts at 70% but you see this value regularly flap somewhere in the range of 69.8% to 70.8%, you can modify the value safely to 71% or even 72%, so that you don't get unnecessary alerts due to flapping or temporary spikes.
It’s also a good idea to set up incremental alerts. For the same example above, having an alert for when the CPU usage hits 80% would let you know that something is not normal if an alert hits. This could be due to a sudden increase in your users or system load for which you’ll need to scale your infrastructure. If you’re consistently being hit with these incremental alerts, it’s a clear indication of the urgency of action needed.
Setting the right alerts and the right threshold values at your monitoring tool will reduce a lot of noise. Another layer of noise reduction and alert optimization can be done on your on-call tool.
Here we focus on Squadcast specific features that can help you reduce alert noise. However, you should find similar options in other incident management or on-call tools that you’re currently using. If you’re yet to decide on an on-call tool, it would be wise to check if reducing alert noise is a possibility before you decide on one.
- Merging Duplicate Incidents
In most cases, the same alerts come in repeatedly and if you’ve set this type of alert to notify your on-call team, it can get annoying very quickly.
For example, let's say there is a prometheus alert rule that checks for disk usage every hour and triggers an alert if it is above 50% and there is another rule which triggers an alert if the disk usage is above 70%.
So, the first rule is to give you a heads up that the disk usage has crossed the halfway mark and you should get ready in a few days / weeks to clean up log files to either free up or add more storage capacity. The second rule is to tell you that you need to do the cleanup / capacity addition immediately (few hours) in-order to maintain your system reliability.
But having the hourly alert for 50% disk usage till it crosses the 70% mark will be very annoying and to be honest, not helpful especially if it takes more than a couple days to reach the next level. You need to define the deduplication rules, so the on-call system knows how to merge duplicate events and notify only on the first time.
However, having an hourly alert for the 50% disk usage till it crosses the 70% mark is not just annoying but also not very helpful. To ensure that these warning alerts are not constantly calling the on-call engineer, you can configure your deduplication rules.
In Squadcast, for each monitoring tool of a service, you can set up deduplication rules based on any key-value pairs in the alert JSON. This can be based on the incident title, description, hostname or any other information available. These rules are specifically defined by you based on monitoring needs and Squadcast provides the platform to configure it your way. It allows for you to use any regex or logical operations and also allows you to combine multiple operations.
- Setup Tagging to route incidents to the right person(s)
Each service has its own team and there is an escalation matrix associated with it. However, not all alerts are equal; some are less important, some are critical, some need people from different teams, some alerts need to be sent to the customer facing team, some require the management involvement and more.
So, apart from the default escalation policy associated with a service, you can use our Tagging rules as an engine to classify and automatically route them to the right responder.Similar to the deduplication rules above, you can setup key:value pair tags based on the alert JSON and add any color to that tag. You can then use the tag as a means to override the default escalation policy and replace it with a user, a different escalation policy or a squad.
This opens up a lot of possibilities in the way you handle incident management today. Just to explain the extent of flexibility this provides, here are a few examples:
Service: Infrastructure (SRE)
1st Layer - Primary on-call person(s)
2nd Layer - Secondary on-call person(s)
3rd Layer - The entire SRE squad
4th Layer - Management
Let's say a CPU alert of 70% usage is received for your backend or billing systems. Note, that this is a high severity incident and is definitely not the same as the “CPU usage above 50%” alert. Here, your application is not able to serve your users and the billing portal isn't functioning. This needs to be resolved immediately and you’ll need an SME involved. In this case, waiting for it to progress as per the on-call escalation would only delay resolution and cause more negative customer impact . You can set up your tagging and routing rules to accommodate such high severity scenarios. Here’s what the Tagging and Routing rules would look like.
If, (payload.meta.cpu >= 70 && re(payload.meta.hostname, "^backend-server.*)) then setup tags, severity:critical (color:Red) notify:sre-team
If, (payload.meta.cpu >= 70 && re(payload.meta.hostname, "^billing.*")") then setup tags, severity:critical (color:Red) Notify:billing-critical-escalation
If, (tags.severity = "critical" && tags.notify = "sre-team") then route incident to, sre-squad
If, (tags.severity = "critical" && tags.notify = "billing-critical-escalation") then route incident to, Critical Billing Escalation policy
In this example, if the backend server reaches a critical level of CPU usage, we are notifying the entire SRE squad immediately. If this is the case for the billing server, we are notifying a Critical Billing Escalation policy, which might be different from the default escalation policy for the service like in the example escalation policy stated above.
In Example 1, we have seen how the entire team is notified in case of a critical incident. In this example, we can see the implementation of a similar solution for less severe incidents. In cases like these, we can choose to notify just 1 person instead of the entire escalation policy or a team.
This example is an actual use case practised by us within Squadcast.
We set up our MongoDB Atlas alerts, specifically for query targeting:
If the query targeting value is less than 2000, the Tag “severity:low” is attached to the incident and it is automatically routed to the junior engineer responsible for optimizing the database queries.
If the query targeting value is above 2000, the Tag “severity:high” is attached to the incident and it is automatically routed to the senior engineer who will then optimize the complex database queries.
These are just two of many ways you can choose to use Tagging and Routing rules. This will help you streamline your incident response process and get your MTTR down significantly.
- Suppress not-so important incidents
If you still want to get some alerts sent to your on-call tool, alerts which are good to be recorded but need not alert anybody, you can set up suppression rules in Squadcast.
You can define a suppression rule based on the content of the message or description of the incident. Any incident for that specific service matching the configured rules will be suppressed and nobody will be notified. This will still be recorded in Squadcast for future reference.
Similarly, you can set up maintenance mode (one-time or recurring) for a service and any alerts for the service during such maintenance windows will be automatically suppressed.
We hope these practices help you reduce alert noise and improve your on-call experience. We’d love to hear from you on other best practices that can be followed to better on-call.
Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, unify internal & external SLIs, automate incident resolution and create a knowledge base to effectively handle incidents.