Originally published on Failure is Inevitable.
Adopting SRE principles into your organization can be a big undertaking. You’ll need to develop new practices and procedures to minimize the costs of incident coordination. You’ll need to create a retrospective process that encourages continuous learning. You’ll need to shift culture to begin appreciating failure as an opportunity to grow. Your transition to the world of SRE will also require buy-in from all levels of your organization.
To achieve this buy-in, you’ll need to show the value of SRE. SRE is often framed as a mentality for teams looking to adopt a blameless culture. Reaching this mentality, where everyone across software teams feels accountable for quality, is the true goal of implementing SRE. But shifting culture takes time, and it can be challenging to quantify the business impact to make the necessary investments in cultural transformation. To prove that SRE makes sense for the bottom line, you’ll need to cite specifics. In this blog post, we’ll look at the business value of SRE through customer focus, observability, and efficiency.
We live in an era of reliability. Customers depend on dozens of services every day. They expect for services to be available when they need them. Moreover, customers know that an alternative exists if your service isn’t available. An unhappy customer can always switch to a competitor’s service. No cutting-edge feature will make a customer stay if the service is unusable. Reliability is the foundation of all other features.
Since reliability is so important to customers, we should look at it through their eyes. Reliability isn’t only the service’s availability or maintainability. It’s a subjective measure based on customers’ judgment. SRE tools such as SLIs and SLOs change this subjective quality into something actionable. You gain a new way to measure customer impact based on monitorable data.
The business value of looking at reliability through a customer’s eyes is twofold:
- Developing for reliability improves user happiness and retention. As development proposes new projects, reliability is considered a necessary feature. The project’s impact on reliability can be measured against the SLO. This ensures that reliability isn’t impacted enough to bother customers. New features can help attract customers, but reliability is key for retention.
You avoid overspending/underspending on your most important feature — reliability. As the SLO is set to the customer’s pain point, you know that improvement past that point won’t make a major impact on their happiness. SLIs also reflect what aspects of the service the customer cares most about. Improving the availability of part of your service from 90% to 99% may seem like a worthy investment. But, you should take into account the frequency that customers use that service Boosting the reliability for a frequently used service from 99% to even 99.25% may have more customer impact. SRE allows you to quantify and see the business value of investments in reliability.
It can be difficult for organizations to figure out the true cost or value of a decision. For example, consider the cost of a service outage. It’s tempting to simply calculate the typical profit the service would generate during the outage. You might also factor in the wage expenses of disrupting engineering attention (the costs of ‘unplanned’ work). These costs are important, but they’re only the first steps. You also need to account for:
Impact on customers’ perception of your product quality and top-line opportunity costs
Reduced capacity and time spent on innovation
Impact on on-call load balancing and heightened chances of burnout
Depletion of your error budget, delaying other development projects
However, there are some benefits to acknowledge as well:
- Improvement of runbooks and other incident response tools
- Knowledge gained of incident patterns
The complete impact of one incident starts to become difficult to calculate. Some aspects, like the lessons learned, are positive and thus create business value. So how do we account for all these factors?
With SRE, many subjective and qualitative factors can be made objective and quantifiable. By using SLIs, teams can connect customer impact to metrics captured by monitoring tools. Teams then feed this information back into the software development lifecycle by determining how to prioritize work — either planned features or reliability investments. This helps drive improvements to the service, in addition to learning from incidents.
Example: For a webpage you maintain, the SLI you are concerned with is latency. If pages load too slowly, people will abandon the site. On average, each request on the webpage loads within 10 seconds. While you monitor this SLI, you notice that people begin to leave your website after only 5 seconds on average. Thus, your SLO should state that all web pages should load within 5 second in order to keep customers happy. You begin to divert development resources to improve the latency of your site.
By building up these impactful indicators, you build a clear picture of organizational health. Understanding metrics such as engineering toil and development velocity is essential on an organizational scale. You can gain even more insights into them by connecting them to SLIs.
Look at how incidents impact these organizational health metrics—how much toil is generated by an outage, for example. This makes a connection from a very high level metric of your organizational health and monitorable reliability data. Having this connection allows you to observe organizational health more easily and objectively.
In the other direction, you can see how these metrics impact your SLIs. This connects other organizational health metrics with customer happiness. Based on this connection, you can further refine your definitions of development velocity. You can contextualize it as “velocity towards customer satisfaction”.
SRE allows you to observe business value or cost as easily as you’d observe your service’s uptime. By using objective measurements, you can make key decisions around development and operations work in an objective way, with confidence.
The cost of downtime is higher than ever. A report created in 2015 gives some harrowing estimates. And reliability has only increased in importance in the last five years. By observing SRE metrics, you can see the impact incidents have on your bottom line today. Fortunately, SRE also gives you the tools to reduce that impact.
By developing for reliability, you can reduce the frequency and severity of incidents. Don’t expect to stop all incidents, though—a key lesson of SRE is that failure is inevitable. When an incident occurs, you have two goals: to restore the service, and to learn as much as you can.
SRE best practices enable teams to resolve incidents faster. Teams create processes to classify, alert on, and respond to incidents. This streamlines responses and eliminates toil. As much of the response as possible is automated, minimizing toil and freeing up capacity for the things that matter, such as innovation.
SRE also helps teams learn from incidents. Practices such as incident retrospectives (AKA postmortems) may at first seem like overhead. But as you build up a library of incident lessons, you’ll find they’re worth the investment. These lessons can be put back into incident response procedures. You’ll find what works best for your runbooks, discover opportunities to automate, and more.
These learnings can also feed into development cycles. Development can account for areas with recurring reliability issues. Teams can decide to either make leeway for them within the error budget, or fix them if they impact customer happiness. The business value in learning grows with every incident you face.
Once you understand the value SRE can bring to your organization, you’ll know it’s time to make the investment. Check out our demo to see how Blameless can help!