These are the notes from Chapter 4: Service Level Objectives, from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
Disclaimer: the notes from this chapter may look abstract if you’re not familiar with SLAs, SLOs, and SLIs. I will try to put some of the definitions to help with the understanding.
An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.
Most services consider request latency—how long it takes to return a response to a request—as a key SLI. Other common SLIs include the error rate…
Another kind of SLI important to SREs is availability, or the fraction of the time that a service is usable.
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.
Choosing and publishing SLOs to users sets expectations about how a service will perform.
SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial—a rebate or a penalty—but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask "what happens if the SLOs aren’t met?": if there is no explicit consequence, then you are almost certainly looking at an SLO.
Choosing too many indicators makes it hard to pay the right level of attention to the indicators that matter, while choosing too few may leave significant behaviors of your system unexamined.
Start by thinking about (or finding out!) what your users care about, not what you can measure. Often, what your users care about is difficult or impossible to measure, so you’ll end up approximating users’ needs in some way.
For maximum clarity, SLOs should specify how they’re measured and the conditions under which they’re valid.
Example:
99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers).
It’s both unrealistic and undesirable to insist that SLOs will be met 100% of the time: doing so can reduce the rate of innovation and deployment, require expensive, overly conservative solutions, or both. Instead, it is better to allow an error budget—a rate at which the SLOs can be missed—and track that on a daily or weekly basis.
Choose just enough SLOs to provide good coverage of your system’s attributes. Defend the SLOs you pick: if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO.
SLOs can—and should—be a major driver in prioritizing work for SREs and product developers, because they reflect what users care about.
Using a tighter internal SLO than the SLO advertised to users gives you room to respond to chronic problems before they become visible externally. An SLO buffer also makes it possible to accommodate reimplementations that trade performance for other attributes, such as cost or ease of maintenance, without having to disappoint users.
Users build on the reality of what you offer, rather than what you say you’ll supply, particularly for infrastructure services. If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally, throttling some requests, or designing the system so that it isn’t faster under light loads.
This remembered me of a tweet from Daniel Vassalo - thanks Google for helping me to find it:
When I worked at Amazon we used to make some software slow on purpose.
Then when a problem caused real slowness, we’d remove the fake delays and things would feel normal.
The fake delays made users happier, even though they all wanted faster software.
Makes you think.23:46 PM - 31 Oct 2022
It seems crazy, and definitely makes you think!
A lot of the “how” Google implements the indicators, and why, can only be fully grasped by reading the whole chapter. I’d recommend reserving 15-20min, to prepare a coffee or tea, and read the whole chapter in one go. It’s worth it!
If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.
You can also follow me on Twitter and Mastodon.
Photo by Cytonn Photography on Unsplash
Top comments (1)
great article