By: Emily Arnott, Failure is Inevitable.
Data helps best-in-class teams make the right decisions. Analyzing your system’s metrics shows you where to invest time and resources. A common type of metric is Mean Time to X, or MTTx. These metrics detail the average time it takes for something to happen. The “x” can represent events or stages in a system’s incident response process.
Yet, MTTx metrics rarely tell the whole story of a system’s reliability. To understand what MTTx metrics are really telling you, you’ll need to combine them with other data. In this blog post, we’ll cover:
- What are common MTTx metrics and why are they used?
- What are some problems with relying on MTTx metrics?
- How can I make MTTx metrics more helpful?
- How do I move away from shallow metrics?
- How better metrics help build a blameless culture
For each metric, trends can help suggest where to work on improvement. For example, if the MTTD is increasing, you might work to improve your monitoring. But, MTTx metrics alone are insufficient to identify trends in reliability.
In an experiment detailed in the ebook Incident Metrics in SRE, author Štěpán Davidovič ran simulations of multiple systems with varying incident frequencies and durations. He generated sets of hypothetical data and compared the MTTx metrics from each. The goal was to determine if changes made to improve MTTx metrics (such as buying a tool) would reflect in the system.
The findings were conclusive: “MTTx metrics will probably mislead you.” As the experiment stated, “Even though in the simulation the improvement always worked, 38% of the simulations had the MTTR difference fall below zero for Company A, 40% for Company B, and 20% for Company C. Looking at the absolute change in MTTR, the probability of seeing at least a 15-minute improvement is only 49%, 50%, and 64%, respectively. Even though the product in the scenario worked and shortened incidents, the odds of detecting any improvement at all are well outside the tolerance of 10% random flukes.”
This means that even if your tool or process improvement is working, you may not even be able to detect it. This makes it hard to understand what actually improves incident response. And, it doesn’t really tell us anything about the overall system reliability.
MTTx metrics are more helpful when contextualized with other information about the incident. As Blameless SRE Architect Kurt Andersen suggests, “What can be enlightening is to combine these metrics with some form of incident categorization.” Using your incident classification process, you can analyze MTTx metrics for a smaller subset of incidents.
Here are some ways you can further categorize incidents to work with more meaningful data:
- The severity of the incident
- How the incident was discovered (internally or via customer report)
- The service area disrupted
- The resources used in responding to the incident (such as runbooks, backups)
- Other monitoring data about the system when the incident occurred (such as server load)
Here are some examples of how these combinations can lead to actionable change:
- If the MTTA for customer-reported incidents is much higher than for internally detected incidents, can you create a faster pipeline for processing customer reports? Or, is there a way your monitoring could detect the issue so customer reports are less frequent?
- If using a certain runbook leads to lower MTTR metrics, what about that runbook could be adopted into other runbooks?
- If one area of service has very high MTTD, what monitoring tools could you implement to catch incidents faster?
As you conduct deeper analysis on your metrics, you’ll find there’s no single MTTx metric that can tell the whole story. However, there are better ways you can analyze your data to gain insight into your overall reliability and incident response processes.
One of the most important things you want to assess after an incident is customer impact. This can be difficult to determine. Reliability is subjective, based on how customers perceive your service.
To determine the impact on customer happiness, you can use SLIs and SLOs. SLIs, or service level indicators, measure how key areas of your services are performing against customer expectations. SLOs, or service level objectives, mark where customers begin to be pained by unreliability.
How you perform against your SLO is often a better indicator of reliability than MTTx metrics. This is because reliability is determined by your users. SLOs help you understand the effect that incidents have on customer happiness. As SLOs are moveable goals that will change as your customers’ needs change, you should never find yourself or your team goaled for an arbitrary number. Revision is part of setting good SLOs.
Kurt also suggests looking at outliers instead of averages: “In general, I don't find the ‘central tendency’ to be as interesting as investigating outliers for a distribution.” Although they may not represent the typical incidents, outliers in your MTTx trends can be valuable.
Discover what was different about the incident that made it an outlier. Is it something that could occur again? You might need to focus on a qualitative rather than quantitative approach. Lorin Hochstein breaks this concept down in a blog post. Rather than relying on metrics to prevent major incidents, Lorin suggests looking for “signals.” Use your team’s expertise to catch and act on noteworthy data.
In a post for Adaptive Capacity Labs, John Allspaw looks at how to move beyond shallow data. His conclusion is that “meaningful insight comes from studying how real people do real work under real conditions.” Metrics alone cannot contain the many complicating factors in real work.
John shows how to build a “thicker” understanding of data. You can map out how an incident developed and was resolved. This is much “messier” than a single metric, but often more insightful. These complicated representations should be examined when they’re a deviation from the mean.
When you rely on shallow metrics, it can become desirable to game the system or even give up trying to meet KPIs. Team members may feel that their performance is measured by a particular (and sometimes irrelevant) metric. They could be tempted to work to just improve that metric instead of actually improving the system. This phenomenon exists in many industries, from manufacturing to healthcare.
This causes a multitude of problems:
- Employees are hesitant to raise issues that might improve the system if it will negatively affect the metric
- When the metric reaches an undesirable level, employees may blame others to avoid being blamed
- Employees will hesitate to take risks or innovate if they fear it could negatively impact the metric
- Employees may even misreport data to artificially inflate the metric, especially if jobs, promotions, or bonuses depend on it
To empower and encourage employees, you need to cultivate a blameless culture. Moving away from shallow metrics is part of this transformation. Emphasize that everyone has a shared goal of customer satisfaction. Using SLOs as your guiding metric can help teams quantify this.
Emphasize that there is no single “score” for an employee or team’s performance. This encourages teams to see incidents as a chance to learn rather than a major setback.
If you’re looking to get more from your metrics, we can help. Blameless SLOs put incidents in the context of customer satisfaction, and Reliability Insights allows teams to sort MTTx metrics into more informative subsets of data. To see how, feel free to sign up for a demo.
If you enjoyed this blog post, check out these resources: