DEV Community

James Heggs
James Heggs

Posted on

GCP DevOps Certification - Pomodoro Twelve

System Complexity

The Coursera SRE programmes shifts on to discussing system complexity and the introduction of the initial Service Level Indicators. Google recommend that around 1 to a maximum of 3 SLI's per user journey should be enough.

Thoughts behind limiting the number to a maximum of 3 SLI's:

  • Not all metrics make could SLIs
  • Each SLI increases the cognitive load on the operations team
  • More SLIs lower the signal-to-noise ratio (which can impact resolution time)

You may also have lots of user journey's through your complex system still resulting in many SLI's - however each journey should be assessed as to whether or not it is "important enough" to be tracked by a SLI.

An important caveat

Other metrics you might be already recording still have value. The above recommendation isn't one that should be used to ditch your existing metrics.

A deterioration in SLI's is an indicator that something is wrong, once that deterioration is bad enough to provoke some operational response that the other monitoring systems will really help in ascertaining a cause.

Managing complexity with aggregation

You might have multiple user journeys.

Take example an online store. People can view a home page that lists products. They can search for products. They can browse products by category and they can see the individual product details.

Each of those could be separate user journeys and result in multiple SLI's. However if you aggregated what you collect (in terms of SLI) from each journey then the SLI could be determined as an overall "Browse" SLI.

The Google course provides this example:

Google SLI aggregation

If all have availability and latency SLI's then they could be aggregated.

Another important caveat

Summing events together can work well for similar user journeys. However it might not fit a scenario where there is a large disparity between rates of the user journey such as request rates differing significantly.

IE. The number of valid events (thinking back to previous pomodoros) for a small but significant user journey could get lost in the noise of higher rate user journey.

If you face that then multiplying the SLI's by a weight based upon their portion of the whole might be on option for normalising data across and aggregation.

Top comments (2)

Collapse
 
memark profile image
Magnus Markling

What this your last pomodoro? Did you sit the exam yet?

Collapse
 
eggsy84 profile image
James Heggs

Hey @memark unfortunately not I put this one down but will re-visit...