Understanding the 0.6-Second Detection Time for Full Outages

#sre #alerting #monitoring #metrics

If you’ve explored the widely-read workbook on Site Reliability Engineering (SRE), you might have encountered the section on the five methods for alerting based on Service Level Objectives (SLOs) - Chapter 5. In the first method, which is the most basic and not generally recommended, an alert is triggered when the target error rate exceeds the SLO threshold. This is represented as "1: Target Error Rate ≥ SLO Threshold."

One aspect of this method is the claim that the detection time for a full outage is just 0.6 seconds. I found myself questioning where this 0.6-second figure comes from, as it seems calculated.

The error rate for alerting is designed to trigger as soon as possible, especially in the case of a full outage where the error rate would be 100%. So, why is the detection time cited as 0.6 seconds?

Despite extensive searching and effort to understand this, the explanation was not clear to me. I talked to @Ray to give some help and after all clear, I decided to write this blog post, hoping to clarify the concept in my own way.

In the book, it states:

if the SLO is 99.9% over 30 days, alert if the error rate over the previous 10 minutes is ≥ 0.1%:

- alert: HighErrorRate
  expr: job:slo_errors_per_request:ratio_rate10m{job="myjob"} >= 0.001

Assumptions

To grasp the 0.6-second detection time, let’s make two assumptions:

Alerts are evaluated in real-time.
We have a consistent rate of 100 events per second.

Question

From the graph provided, it’s evident that with an error rate of 1%, the detection time is around 1 minute.

Given that there is no fractional time involved, let's address the following question:

Why is the detection time 1 minute for an error rate of 1%?

The alert expression used is:

sum(rate(slo_errors[10m])) by (job)
/
sum(rate(slo_requests|10m])) by (job) > 0.001

This calculates a time window of 10 minutes.

So when:

1 second = 100 events
10 minutes = 600 seconds = 60,000 events

To achieve a 0.1% error rate (i.e. alert to fire), you need:

0.1% of the 60,000 events to fail in 10 minutes = 60 events

For an error rate of 1%:

1 failed event every 1 second To accumulate these 60 failed events and meet the 0.1% error rate threshold, it would take 60 seconds (1 minute).

For a full outage (100% error rate):

100 failed event every 1 second To accumulate these 60 failed events and meet the 0.1% error rate threshold, it would take 0.6 second!

Let us put it in equation

Detection Time Equation

Define Parameters:
- Event Rate (R): Number of events per second.
- Failure Rate Threshold (F): Error rate threshold in decimal form (e.g., 0.001 for 0.1%).
- Evaluation Window (W): Time window in seconds for the alerting calculation (e.g., 600 seconds for 10 minutes).
Calculate Total Events Needed for Threshold:
- Total events in the evaluation window = R * W
- Required number of failed events to hit the threshold = (R * W) * F
Determine Detection Time:
- Detection time (T) is the time required to accumulate the number of failed events necessary to meet the threshold.

The formula to calculate detection time is:

T = Required Failed Events / Failure Rate

Simplifying, the detection time can be given by:

T = ((R * W) * F) / R

Example

For instance, if the event rate is 100 events per second (R = 100), the error rate threshold is 0.1% (F = 0.001), and the evaluation window is 10 minutes (600 seconds) (W = 600):

Calculate Required Failed Events:
- Total events in 10 minutes = R * W = 100 * 600 = 60,000
- Required failed events = 60,000 * 0.001 = 60
Calculate Detection Time:
- Detection time = 60 / 100 = 0.6 seconds

Thus, for a full outage, where the failure rate is 100%, the detection time would be around 0.6 seconds.

I hope this explanation helps clear up the 0.6-second detection time!