Software SLA, SLOs and SLIs

#observability #sla #slo #sli

In today's world, people's expectations for free and paid software services are high, including speed, uptime and useful UX. Hence, user base has the right to understand your SaaS availability, quality, and response plans in case a disaster strikes. No one likes to fight over the spoils, but the Service Level Agreement will provide covering in case something goes wrong. Moreover, with system observability in place, the derived service metrics can be used as baseline for setting higher service excellence targets or OKRs.

Now, let's talk briefly about what are Software SLA, SLOs, and SLIs.

SLA

It is a description of what must happen if an SLO is not met. Generally, a service level agreement is a legal agreement between provider and customers and might even include terms of compensation.

Example:

If the service does not provide 99% availability over 1 month, the service provider compensates the customer for every minute out of compliance.

SLA = SLOs + Written & Signed Consequences

See AWS S3's SLA for instance: https://aws.amazon.com/s3/sla/

SLO

It is a scoped objective that engineering team must hit in order to meet the agreement. Here are some considerations while setting it:

Identify key metrics (service level indicators — SLIs) from the user perspective, such as availability and latency.
Make it measurable – such as 300 ms. latency
Allow some space (error budget) such as 300 ms. 99% of the time
Be clear on what you promise, for example 99% of the time (averaged over 1 month), HTTP calls that are status 200 completed under 300 ms.

Example, combining the 2 SLIs:

Service responses shall be available 99% and faster than 300 ms for 99% of all valid requests measured over 1 month.

SLO = Availability SLI + Satisfying Latency SLI

SLI

It is a carefully defined and measurable performance metric, and usually an aggregation of events.

Considering the agreed upon SLO we promised to users, we measure multiple service indicators that attributed to user happiness while using our app.

Example: possible definitions of SLIs for the “search” interaction might be as follow:

Notice that we only cover Availability and Latency types of measurable SLIs here, they are others depending on the nature of your endpoints. See: https://sre.google/workbook/implementing-slos/#slis-for-different-types-of-services

Summary

Measuring your software quality is an on-going process, simply because your software evolves over time. Find time to sit down with the stack-holders at your company to go over the numbers, know where you are with the metrics and explore how to improve those indications. Instead of pointing fingers when things go south, why not being data-driven from the start, making sure software is reliable while engineers continuously shipping features.