Squadcast Inc for Squadcast

Posted on Apr 28, 2021 • Originally published at squadcast.com

Reduce Toil with Better Alerting Systems

If not tackled early, increasing toil can affect the morale and productivity of your SRE team. In this blog we look at some of the ways you can counter toil with the help of better alerting systems in place.

Are you an SRE or On-call engineer struggling to manage toil?

Toil is any repetitive or monotonous activity that can lead to frustration within an incident management team. Also at the business level, toil doesn't add any functional value towards growth and productivity.

However, toil can be tackled with simple but effective automation strategies across every stage of incident management process.

In this blog, we dig deeper into how to reduce toil by defining better alerting strategies within an alert management system.

Toil Defined

Google’s SRE workbook defines toil as,

"the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows."

For reducing toil, first, we should learn the characteristics of toil (identify) and calculate the time taken to resolve incidents manually (measure toil).

Ways to Identify, Measure Toil

Identifying Toil is basically understanding the overall characteristics of a routine task. It can be done by evaluating a task on basis of,

what type of work is involved
who will be responsible for executing it
how it can be completed and
whether it is easy (less than an hour), medium (less than few hours), or hard (less than a day) in terms of difficulty while executing it

Measuring Toil is simply computing human time spent on each toilsome activity.

It is done by analyzing certain trends:

on-call incident response
through tickets and
survey data

With these analysis, we can prioritize toil to create the balance between production tasks and routine operational tasks.

Note: In all organizations, the goal is to ensure that toil does not occupy more than 50% of SRE’s time. This is to keep the team focused on production-related functionalities.

Before we look into the elaborate causes of toil let’s get to know the after-effects of toil in short.

Effects of Toil

Whether it is an incident management task or any activity, if you keep doing the same task repeatedly over a certain time, then often you will be filled with discontent over the job you do.

In some cases, toil even causes an increased attrition rate due to burnout, boredom, alert fatigue among SREs which may eventually slow down the overall development process.

Let's find out ways to reduce toil by first looking into the various causes that contribute to toil.

Causes of Toil Across an Alerting System

(1) Lack Of Automation In Alert Management Systems

If alerts are repetitive and need to be resolved manually, then managing those alerts would be a tiring task. If your system notifies you that your web requests at 6 AM are 3x higher than usual, this indicates a good amount of traffic to your website but it doesn’t pose any threat to the architecture. These alerts just provide information about your system performance and need no manual intervention. So spending time on suppressing such trivial alerts can result in missing those important ones that need to be addressed manually. Also, manual suppression of too many alerts can add up to toil.

Automation is key in reducing toil, at every stage of alert configuration. If there is a possibility to automate an alert response, then it must be done on priority basis. This would greatly help in reducing alert noise.

(2) Poorly Designed Alert Configuration

A poorly configured alert system would generate either too many alerts or no alerts. These kinds of alerts are due to sensitivity issues within an architecture.

The sensitivity is of two types, over-sensitivity (marginal sensitivity) and under-sensitivity. Over-sensitivity is a condition when the system sends too many alerts. This occurs when alert conditions are marginal at threshold levels.

For example, when the response time degradation in database service is set exactly at 100ms (absolute value) then even the slightest difference would generate too many alerts. So rather than setting alert conditions to be marginal, we can set up relative values like not less than 50%.

On the other hand, under-sensitivity is a condition when a system does not send any alerts. Which poses a bigger problem. This can happen when a system has an issue that goes undetected. There's a risk of running into a major outage and having no means to get to the root cause. In this case, the system might require re-engineering to scrutinize such sensitivity issues.

(3) Ignoring SRE Golden Signals While Configuring Alerts

Latency, Traffic, Errors, and Saturation are the golden signals of SRE that help in monitoring a system. Other variations such as USE (Utilization, Saturation, and Errors) and RED (Rate, Error, and Durability) can also be used to measure key performances of the architecture.

While setting up alerts, the utilization of database, CPU, and memory have to be estimated and optimized following these vital SRE signals.

For example, say if the average load experienced by the infrastructure is 1.5x higher than the normal rotation per second of CPU count, then the system would trigger unusual amounts of alerts. This is due to not having proper optimization in place. So ignoring such basic saturation levels of the system would generate abnormalities which can ultimately result in outages.

(4) Insufficient Information on Alerts

Insufficient information on alerts means that the system is going through some difficulty in processing a particular set of instructions and is not alerting specifically about the ongoing situation. Then this would lead to unusual toil of figuring out where the problem exists and what contributes to an outage.

Let’s say you have received an alert stating “instance i-dk3sldfjsd CPU utilization high". Here, this alert does not convey sufficient information about an incident, like either IP address or hostname. Only with minimal information, the on-call engineer cannot respond to an incident. So s/he might have to open the AWS console to figure out the actual IP location of the server to proceed with the troubleshooting processes. In this scenario, the time taken to logging on to the server and resolving the issue would be substantially high.

Ways to Reduce Toil With Better Alerting System

(1) Set Alert Rules Based on Historic Performance of the System

While configuring alerts, instead of setting tight thresholds, take a look at the “Trend/Historical Rolling Number” of system performances. This can be done by calculating the rate of change in system performances. And it would give a clear idea to fix the right thresholds. Almost all modern monitoring systems help in recording the rate of change in system performances.

For example, let’s consider instances like the percentage of CPU utilization is consistently greater than (70-80)%, or Server Response Time falls above (4-6) ms and the count of log query stands greater than 100-125, then the alerts can be optimized within the performance range of the system by expressing in terms of percentile values like 95th percentile. This will reduce alerts drastically and helps to stay reliable.

Additional Reading: Optimizing your alerts to reduce Alert Noise

(2) Create Proactive Alert Checks

With their predictive characteristics, proactive alerts play a vital role in understanding system performances.

Before we expand further on proactive alerts, here’s a quick look at the different kinds of alerts and their implications.

Investigative Alerts, Proactive Alerts, and Reactive Alerts

In an alert management system, the foremost step in alerting is to categorize alerts. So that we can monitor the system’s health in a strategic order. There are three types of alert categories, ‍

Investigative Alerts are the ones that can cause harm to system health in the long run.

Whenever there is a change in user behavior, and if it falls beyond the scope of defined SLO, then there will be a service failure. For example, if an SRE configures a system to specify conditions into an incident management tool on regex and logical constraints alone and if the developer coded it with different parameterized expressions in different programming languages, that could cause a deviation of conditions by not falling into the said system configurations. So automatically the system would not respond to the user specified instruction and may cause outage in the long run.

It has to be noted that investigative alerts are also referred to as “cause-based alerts” that can turn into toil if not properly aligned with other alerting strategies.

Proactive Alerts are those which pose a future threat to the organization.

For instance, if an alert configured for storage utilization is 100%, then an engineer will be notified only when the storage capacity runs out of space, and the situation might soon turn into an outage. To avoid such incidents, alerts have to be configured for 70% and above. By doing this, the system would alert the team when the storage space is less than 30% of capacity. And the team would have some buffer time to resolve the issue.

This way of predicting system performances and configuring alerts accordingly is called proactive alerts.

Reactive Alerts are those that indicate an immediate threat to business goals.

This kind of alert will arise when the system or service breaches defined SLOs. These alerts notify the team only when an outage occurs, and the team would respond reactively. An example of this would be an unexpected blackout of a payment portal or any feature of a product. In cases like these, the user can’t access anything with respect to the affected service owing to a major incident for the team to handle. This is a reactive alert.

It is the prime responsibility of an incident response team to segregate, prioritize and categorize alerts to have a structured alert response procedure.

Therefore, setting up well-defined alert rules based on reliability targets and automating them is convincingly a possible way to reduce toil.

Ways Proactive Alerts Help In Reducing Toil

Since it is predictive, it helps an incident management team to gather all the required tools beforehand (prepare) for response activities.
It helps in reducing user-reported incidents
It drastically reduces the incident response time
Having all the response plans in hand, the team can easily automate through runbooks or execute necessary steps in resolving an incident. Thus, proactive alerts considerably increase the overall productivity of teams and business
Also, plays an important role in increasing the velocity of innovation

Additional Reading: Curb alert noise for better productivity: How-To's and Best Practices

(3) Configuring “Alert-as-Code”

In SRE practices, Alerting policy is a set of rules or conditions we define to a monitoring system. This set of rules help in notifying the engineering team when there is a system abnormality. Alerting policies play a vital role in maintaining the performance and health of system architecture.

Alert-as-code is an evolutionary technique that helps in defining all the system alerts or the entire alerting policies in the form of code. This helps to point out the incidents more specifically with a monitoring tool.

This alert-as-code configuration can be done while building the system with infrastructure-as-code architecture.

For better understanding, we would like to cite our Squadcast infrastructure as an example for the alert-as-code configuration. Internally at Squadcast, we use Kube-Prometheus to deploy Prometheus inside our architecture, and with that configuration, we create/modify all the alerting rules for our infrastructure. Here, the use case is that all the changes we have made to the monitoring setup are being version controlled over Git and stored in GitHub.

Also, alert-as-code helps in predictive analysis and root causes analysis to scrutinize the underlying reason for an incident. Some of the other use cases are,

This offers a way to automate routine tasks and gain more control over infrastructure with version control platforms.
It saves lots of time by standardizing all those complex and dynamic systems throughout the infrastructure.
It also supports documentation processes for future citations.
Alerts can also be managed by cloud monitoring APIs. It helps in automating the process of creating, editing, and managing alert policies
Alerting APIs are helpful in real-time monitoring of system health and identifying event triggers for categorizing alerts
It supports the team by flagging potential issues within the system architecture

Note: While detecting anomalies, programmatic alerting policy creates alerts only when there is a deviation from the historical performance of the system.

The Squadcast Solution to Reduce Toil

Squadcast has distinctively configurable features that facilitate on-call teams to streamline high-priority alerts and stay productive.

Alert Suppression

Alert suppression is an automation technique used to reduce alert fatigue. Here, non-critical alerts can be suppressed allowing on-call engineers more time to focus on severe incidents that may cause serious damage to their system or infrastructure.

Contextual Tagging, Routing, and Customized Escalation policies

Squadcast allows for customized and refined tagging rules which helps to prioritize alerts by attaching severities to each incident. After tagging, each alert can then be routed to a specific user group or escalated to the concerned team, enabling faster response.

Incident Deduplication

Incident deduplication is a way to weed out those multiple alerts generated for the same incident from different alert sources. The status-based deduplication within the platform goes a step further and facilitates a granular level of control over all alerts that are received from various alert sources. This feature gives that additional control of narrowing down the list of past incidents (based on the status they have) against which deduplication is to be considered. This helps in scenarios with high-failure rate services by accurately diagnosing problems.

Analyzing On-Call Traffic

Squadcast’s analytics dashboard gives a clear perspective about on-call traffic such as the distribution of incidents across various services, their corresponding status during the recovery processes, and the analysis on MTTR, and MTTA. A periodic audit on the data captured can help identify & potentially rectify toilsome activities.

Less Toil, More Productivity!

Right alerts with necessary automation strategies would give way to more effective and toil free incident management ecosystems. These practices would greatly help in reducing operational toil and can ultimately enhance the productivity of the team.

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

DEV Community