James Heggs

Posted on Sep 28, 2020 • Edited on May 9, 2021

GCP DevOps Certification - Pomodoro Three

#devops #certification #sre #googlecloud

How Google see DevOps

Watching through the video content presented by @lizthegrey and @sethvargo - they discuss the potential friction between developers and operations. Change vs Stability.

Thank god I haven't been explaining it wrong! 🤓 Well by this video at least.

Google outline DevOps as having 5 characteristics

Reduce organisation silos
Accept failure as normal
Implement gradual change
Leverage tooling and automation
Measure everything

Moving on to SRE

Google think of Site Reliability Engineering in a similar manner to the way an object oriented class might implement an interface

class SRE implements DevOps

Specifically SRE's will

Share ownership of environments with developers (Reducing organisation silos)
Service Level Objects and Blameless post-mortems (Accepting failure as normal)
Reduce the cost of failure such as Canary releases (Implementing gradual change)
Eliminate as much manual work as possible (Tooling and automation)
Measuring TOIL and system reliability (Measuring everything)

Lets talk reliability

This part I love. Actually question how reliable your systems need to be and the inverse - what unreliability error budget is required.

3 nines = 99.9% = 40mins over 28 day period

So thats just about enough for a monitoring system to spot an issue, alert someone and the human to take action. Depending on the root cause of course.

4 nines = 99.99% = 4 mins over 28 day period

Now you're in to machine based detection and self healing world. Software updates and roll-outs probably need to be isolated to decoupled areas.

5 nines = 99.999% = 28 secs over 28 day period

Good luck! There is even potential for your monitoring system to actually miss this amount of down time. Imagine if you're checking uptime every minute, you could have just missed your downtime issue and falsely reported that you are 'up'.

Now extend those thoughts to consuming public cloud services like Google Cloud. Introduce a support request round trip and you've likely consumed your reliability error budget.

Super interesting thoughts on just how "available" does a system need to be and what implications does it have.

The StayPuft Man

Ok I admit it - I couldn't help myself....today I ran one of the mock tests. You know just to understand my gaps.

I managed to get 7 out of 13 correct - about 53%.

My gaps at this stage were around specific Google API's specifically within Stackdriver and recommended security practices.

To anyone that has worked with me - that probably isn't new information #cowboy

DEV Community

GCP DevOps Certification - Pomodoro Three

How Google see DevOps

Moving on to SRE

Lets talk reliability

3 nines = 99.9% = 40mins over 28 day period

4 nines = 99.99% = 4 mins over 28 day period

5 nines = 99.999% = 28 secs over 28 day period

The StayPuft Man

Top comments (0)

Read next

Launching EC2 Instances with AWS CLI and Advanced Features

Beginners Guide To CDN

From Sunshine to Snowfall: Crafting Weather-Based UIs with DevCycle Feature Flag Challenge

Key Components of a VPC: Detailed Breakdown