This article is a translation of SLI・SLO・SLAについて.
About SLIs, SLOs, and SLAs
I will summarize what I have investigated about SLI, SLO, and SLA.
What are SLOs, SLIs, and SLAs?
SLO, SLI, and SLA are indicators, targets, and agreements related to service levels.
A service level is a specific measure of service provided over a period of time.
- SLI (Service Level Indicator)
- Service level indicators
- Indicators, metrics to measure service levels
- ex. availability, latency, error rate, throughput
- SLO (Serivce Level Objective)
- service level targets
- Target quantitative or qualitative value of service level
- Consider external dependencies
- Communication with external services, externally linked parts such as SLO of managed services, etc.
- SLAs (Service Level Agreements)
- Service level agreement
- Service level agreements and guarantees between service providers and users
- It is better to set the target value looser than SLO
How to set SLI/SLO
I think it's good that the best practices advocated by NewRelic are easy to work with.
newrelic.com - Best practices for setting SLI/SLO in modern systems
It introduces how to formulate SLI/SLO by defining system boundaries, defining functions for each boundary, defining availability for each function, and defining SLI for availability measurement.
When starting the operation of SLI/SLO, it is recommended to start operation with loose values as simple as possible.
cf. sre.google - Chapter 4 - Service Level Objectives
When I actually formulated SLI/SLO for my business, I followed this NewRelic practice, but I adjusted the functional units so that they were not too detailed.
If the unit of function is made finer from the beginning, the operation will become difficult, so I think it is better to adjust the granularity as necessary during operation.
Tips
Tips for keywords related to SLI/SLO.
The difference between reliability and availability
- reliability
- A characteristic of a system that is the degree of tolerance to failure
- Availability
- Degree to which the system can continue to operate
List of uptime and downtime, availability calculation
Availability | Annual Downtime | Monthly Downtime |
---|---|---|
99.0% | 87.6 hours | 7.6 hours |
99.5% | 43.8 hours | 3.65 hours |
99.9% | 8.76 hours | 43.8 minutes |
99.95% | 4.38 hours | 21.9 minutes |
99.99% | 52.56 seconds | 4.38 minutes |
99.999% | 5.256 seconds | 26.28 seconds |
99.9999% | 31.536 seconds | 2.628 seconds |
What is an error budget?
A budget for error, a measure of acceptable reliability calculated relative to an SLO.
ex. SLO 99.99% → error budget 0.01% or less
Impression
By making the service level measurable, it becomes possible to observe whether service users (users or systems) are able to provide services satisfactorily, and for service providers, it becomes an indicator of whether improvement of the service level is necessary. I thought I'd get
Reference
- newrelic.com - What are SLOs, SLIs and SLAs?
- newrelic.com - New Relic Hands-on: SLI/SLO Design Basics
- cloud.google.com - Thinking about SLO, SLI, SLA: What CRE learned in the field
- cloud.google.com - SRE Fundamentals (2021 Edition): Comparing SLIs, SLAs and SLOs
- cloud.google.com - SLOs, SLIs, SLAs, oh my—CRE life lessons
- cloud.google.com - how to deal with availability, that's the question: what CRE learned in the field
- engineering.mericari.com - 2018/12/25 SLI/SLO in Mercari's Web Microservices
- sre.google - sre-book
- qiita.com - Thoughts for formulating SLI/SLO
- qiita.com - Learn about SRE - Error Budget
- bongineer.net - The Difference Between Reliability and Availability
- mathwords.net - what is the downtime for 99% availability, 99.9% etc.
- wnkhs.net - Availability calculations and assumptions (in representative figures)
Top comments (0)