We live in the era of software convenience, where we take for granted that hundreds of services are always at our fingertips. These applications become part of our daily routines because they are so reliable. However, this consistency makes reliability work invisible to the end user. It can be difficult to appreciate the effort behind maintaining a high availability service. Because of that, people may misunderstand exactly what makes a service reliable. If reliability has a vague definition like “it always works,” developing for reliability is impossible.
In order to improve the reliability of your software, you must understand your goals. In this blog post, we’ll break down what software reliability means. We’ll look at how the reliability of your software is perceived, how teams operate to improve reliability, and how to contextualize reliability with customer happiness and cultural lessons.
Everyone has an intuitive sense of reliability, something like: “when I use it, it works.” You might also have a technical intuition of how to measure reliability, perhaps the percentage of time a service provides the correct response. Truly understanding reliability means bridging the gap between this subjective experience and the data that your service provides.
Reliability is a complex and holistic value built upon a variety of factors. The availability and maintainability of a system provides important building blocks in understanding reliability. However, these metrics alone are insufficient. Users perceive reliability only as availability and speed of the service during use. Higher availability during times of low-demand or for infrequently used services matters less than even a small outage of a user-critical feature. SLIs determine which of these metrics impact users most. They are the real basis of your reliability.
Once you understand that reliability exists more in the minds of your users than in contextless data, your development is empowered in a new way. It’s a suboptimal use of time to improve availability past the point where users will notice or care. That development effort could be better spent elsewhere. Once you set SLOs for the minimum acceptable value of your SLIs, the surplus becomes your error budget. This error budget allows you to develop confidently, knowing that your users won’t be impacted by any decrease in reliability until it is consumed.
Essentially, your understanding of reliability is built from two directions. At the top, user experiences and pain points allow you to determine SLIs. Then, from the bottom, SLOs monitor data, creating an objective representation of availability metrics. Through this method, the gap is bridged between users’ confidence in your service, and data that your service can provide.
The other lesson of SRE is that incidents are inevitable. Reliable software isn’t software with 100% availability, because such software doesn’t exist. Instead, it’s software that’s available enough that users aren’t pained. When incidents occur, teams must respond to minimize user impact and improve going forward.
Good incident response is consistent incident response. Just as important as the reliability of your code is the reliability of your procedures when something goes wrong. With consistent incident response, responding engineers can proceed more effectively and confidently. Each step of an incident response cycle reflects a commitment to reliability:
- New incidents are classified based on an established system of severity and area of impact. Each classification maps to a different response. This ensures that incidents are dealt with consistently, allowing you to confidently triage and assign.
- Engineers are alerted to respond to the incident based on the classification. By building reliable on-call systems, you ensure that the correct people are available, that backup plans exist, and that burnout is avoided.
- Respondents use runbooks to begin responding to the incident. Runbooks work best when they can account for many possibilities, minimizing the cognitive toil of finding new solutions. Engineers can confidently proceed through the runbook, relying on the completeness of its information.
- The entire response is collected into an incident retrospective (see our top 5 best practices on doing them well). The incident retrospective (also referred to as postmortem, post-incident report, etc.) will contain the steps taken, key correspondences, relevant monitoring data, and anything else valuable to learning from the incident. Reviewing the retrospective provides opportunities to improve the incident’s classifications, the alerting steps taken, and the runbooks implemented. These procedures are reliable not because they were created perfectly the first time, but because they undergo these cycles of continuous learning and revision.
The reliability of your software ultimately depends on the resources you have available. Teams of all sizes can implement SRE best practices throughout their organization. As you grow, you may begin building a dedicated SRE team. SREs can fit many roles and help in many stages of the DevOps lifecycle. Structuring your team so that unplanned work can be accounted for is key to a reliable development cycle.
We’ve seen that reliability is a reflection of users’ confidence in your service. We’ve also seen that reliability is a goal that motivates high-performing DevOps teams in every step of the software lifecycle. These may seem like very different definitions of software reliability. One relates to how your software is perceived, the other with internal practices invisible to the user. The way to reconcile these two understandings of reliability is by contextualizing them with customer happiness and cultural lessons.
Ultimately, there is no more important metric of success than the happiness of your customers. Reliability provides business value by attracting and keeping customers. This link between reliability and business success extends to your practices, too. Every investment you make in reliability engineering can be connected to the bottom line of your organization.
At the same time, focusing on reliability instills cultural lessons throughout your organization. SRE advocates for a human (socio-technical) approach to systems, where both the perspective of the user and the on-call engineer are empathetically considered. By embracing the inevitability of failure and celebrating it, your team will grow in confidence. Fair on-call systems reduce burnout and ensure that incident responders are working at their best. The benefits of reliability emerge naturally from this empathetic perspective. Working from a human perspective, whether it's from the lens of the user of your service or the engineers designing it, will lead you to reliability success.
Understanding what reliability means is an important first step on your journey to reliability excellence. If you want to see how Blameless can help you along your reliability journey through implementing best practices such as SLOs and incident retrospectives, check out our demo.