In our latest two-part series blog, Adam Hammond, talks about how you can build sustainable SLOs that are appropriate for your users, your technology platform, and your business which in turn will help you make your systems robust, your customers happy, and your business boom.
Service Level Objectives (SLOs) are a powerful operational tool that uses metric-based targets to constrain activities that may have a negative impact on users (such as maintenance or failed deployments). Traditionally, you may have heard it being used in contractual terms within Service Level Agreements (SLAs), where SLOs are used to identify guarantees for IT platforms (SaaS, IaaS, PaaS, etc.). However, they are far more than that: SLOs are a powerful tool that can be used not only by the “business people” but also by technical staff to drive process improvement and technological advancement. SLOs have a formidable use as metric-based indicators that show you what needs to be improved in your systems, its capabilities, and where you can get your best “bang for buck” when it comes to focusing your work efforts. However, SLOs must be influenced by data, and that data can only come from your customers. A lot of IT professionals tend to think that they know the best metrics, and they do; the only problem is that they are the best metrics for monitoring systems, not for improving customer satisfaction. Today, we’re going to help you build sustainable SLOs that are appropriate for your users, your technology platform, and your business that will help you make your systems robust, your customers happy, and your business boom.
Now that we have an idea of what SLOs are, we need to go about establishing a data-based approach that will result in positive user outcomes. This is a two-stage process that involves data gathering and then using that data to build your SLOs. The source data for these questions come from three main places: your users, your system, and your business processes. Prepare to go out and talk to clients on zoom calls, trawl through logs, and understand the maintenance and support lifecycle of your system. There is no prescription for these questions, they are subjective, and everyone’s scenario will be different. It is also important to remember the Pareto Principle: 80% of your users use about 20% of your system. Therefore you will get the best value out of this exercise by targeting and providing SLOs for the most commonly used parts of your system.
Example Questions - When do my users actively or passively use my system? - How much maintenance do I need to perform and how regularly does it need to be? - What tolerance would my users have for outages? - Would your users consider your application critical to your business? - How well is my system performing at the moment? - What levels of performance do my users require?
When you have finished your data gathering exercise, it is time to focus on actually setting your SLOs. SLOs will generally - but not always - fall into the following categories:
These categories cover most of the things that people consider to be aspects of quality. They also translate easily into metrics that you can use to objectively measure your system against the requirements of your SLOs. Finally, when you define your SLOs, remember that a good SLO should be S.M.A.R.T.
Specific: an SLO should expressly state what it measures (e.g. we want to measure availability by testing whether a request can be made to the server, not we want the server to be up).
Measurable: the SLO should be something that can be measured (e.g. disk latency should be less than 5ms, not the disk should be quick).
Achievable: you should be able to meet your SLOs (e.g. if an underlying service has an SLO of 95%, you cannot guarantee 100%).
Relevant: your SLO should reflect the user experience (e.g. an appropriate metric for a web server is response time, not CPU activity).
Timebound: an SLO should cover a period that is appropriate for how your system is used (e.g. if your users only use your system between 9 AM and 5 PM, a 24-hour SLO will only dilute your actual metrics and hide issues).
Now, let’s get down to creating an SLO. Whether an SLO is achievable or relevant is not pertinent to the specific wording required, but it dictates whether a particular SLO should be set. For example, if the average time to retrieve a file is five minutes, you would not guarantee that the file can be delivered faster than that (because on average, it won’t). Alternatively, if your users only care that files are consistently, but eventually delivered to them then a retrieval time-based SLO is probably not for you. In this case, the best SLO would be one that guarantees that a proportion of files are always delivered to users, regardless of time to retrieve and deliver (i.e. percentage of successful retrievals).
Once we’ve determined that an SLO is appropriate, let’s get the SLO down on paper. Remember, we need to make sure that the wording is Specific, Timebound, and that it is Measurable. If it is not all of these things, then it simply cannot be used as an SLO. Let’s consider an example. A system processes stock trades and all requests need to be finalised within 300ms as dictated by a regulatory body. The company running the system wants to offer an SLO that requests, on average, over 30 days are completed faster than 250ms. The system currently responds to 98% requests within 232ms on a 30-day rolling average. The SLO text would look like this:
Is this a good SLO? Yes. The system already exceeds the SLO, so it is Achievable. There is a legal requirement that requests are finalised within the SLO limits, so it is Relevant. We are Specific with the metric we want to guarantee our performance against, which is the request response rate. We have limited our SLO to a 30-day period, which allows us to run reporting that is Timebound. Finally, our metric is Measurable via a Prometheus metric. We have met all the requirements for a SMART SLO that has been tailored to the user experience.
How to account for maintenance and scheduled downtime in your SLOs Everyone needs to maintain their systems; some are highly available and have no downtime, while others need some downtime. The simple answer is to bake your maintenance into the SLO. If you know you can provide 97% availability for a system over a month, but you need 14 hours of maintenance (2%), then only offer 95%. It is better to underpromise and overdeliver than be red-faced (and out of pocket) because your system has been offline (and you expected it).
Now that we have our SLOs, they’re SMART, but… we are just not meeting our targets (or want to exceed them). What do we do? We need to make our systems performant enough to overcome this challenge. While demanding in terms of effort, this is right in the SRE wheelhouse, and will predominantly rely on your expertise and knowledge to improve your system performance. If users require faster requests, streamline your proxy config. If disk reads are too slow, consider high IOPS or higher throughput alternatives. If batch jobs are taking too long, right-size the instances so that they process in the correct amount of time. Some more difficult approaches may include changing your operating system, database platforms, or, even development frameworks. It entirely depends on your ability to analyse and understand the factors in your system that affect your SLOs and mitigating those issues through proper SRE practice.
There are also other options aside from the more technical approach: improved monitoring and disaster recovery. By improving your monitoring, you can ensure that problems are caught before they affect your SLOs. Your disaster recovery plan is key to managing and maintaining your SLOs. Disasters come when we least expect them, so practising and improving DR procedures means that if disaster strikes, you are able to restore service as quickly as possible. This will limit the overall impact to SLOs by ensuring that any disaster downtime is limited to only that which is strictly necessary to recover your systems.
Using these processes, you can deliver SLOs that will please your users and make their experience with your systems a delight. By meeting (and hopefully, exceeding) their expectations, you will build lifelong customers that will evangelise your business and products.
In the second part of this blog, we will be looking at an example based on Bill from The Phoenix Project that will highlight how “achieving SLOs” is not always good for business if those SLOs aren’t derived from customer needs.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.