GCP DevOps Certification - Pomodoro Twelve

#googlecloud #devops #sre #certification

System Complexity

The Coursera SRE programmes shifts on to discussing system complexity and the introduction of the initial Service Level Indicators. Google recommend that around 1 to a maximum of 3 SLI's per user journey should be enough.

Thoughts behind limiting the number to a maximum of 3 SLI's:

Not all metrics make could SLIs
Each SLI increases the cognitive load on the operations team
More SLIs lower the signal-to-noise ratio (which can impact resolution time)

You may also have lots of user journey's through your complex system still resulting in many SLI's - however each journey should be assessed as to whether or not it is "important enough" to be tracked by a SLI.

An important caveat

Other metrics you might be already recording still have value. The above recommendation isn't one that should be used to ditch your existing metrics.

A deterioration in SLI's is an indicator that something is wrong, once that deterioration is bad enough to provoke some operational response that the other monitoring systems will really help in ascertaining a cause.

Managing complexity with aggregation

You might have multiple user journeys.

Take example an online store. People can view a home page that lists products. They can search for products. They can browse products by category and they can see the individual product details.

Each of those could be separate user journeys and result in multiple SLI's. However if you aggregated what you collect (in terms of SLI) from each journey then the SLI could be determined as an overall "Browse" SLI.

The Google course provides this example:

If all have availability and latency SLI's then they could be aggregated.

Another important caveat

Summing events together can work well for similar user journeys. However it might not fit a scenario where there is a large disparity between rates of the user journey such as request rates differing significantly.

IE. The number of valid events (thinking back to previous pomodoros) for a small but significant user journey could get lost in the noise of higher rate user journey.

If you face that then multiplying the SLI's by a weight based upon their portion of the whole might be on option for normalising data across and aggregation.