How Google see DevOps
Watching through the video content presented by @lizthegrey and @sethvargo - they discuss the potential friction between developers and operations. Change vs Stability.
Thank god I haven't been explaining it wrong! 🤓 Well by this video at least.
Google outline DevOps as having 5 characteristics
Reduce organisation silos
Accept failure as normal
Implement gradual change
Leverage tooling and automation
Measure everything
Moving on to SRE
Google think of Site Reliability Engineering in a similar manner to the way an object oriented class might implement an interface
class SRE implements DevOps
Specifically SRE's will
Share ownership of environments with developers (Reducing organisation silos)
Service Level Objects and Blameless post-mortems (Accepting failure as normal)
Reduce the cost of failure such as Canary releases (Implementing gradual change)
Eliminate as much manual work as possible (Tooling and automation)
Measuring TOIL and system reliability (Measuring everything)
Lets talk reliability
This part I love. Actually question how reliable your systems need to be and the inverse - what unreliability error budget is required.
3 nines = 99.9% = 40mins over 28 day period
So thats just about enough for a monitoring system to spot an issue, alert someone and the human to take action. Depending on the root cause of course.
4 nines = 99.99% = 4 mins over 28 day period
Now you're in to machine based detection and self healing world. Software updates and roll-outs probably need to be isolated to decoupled areas.
5 nines = 99.999% = 28 secs over 28 day period
Good luck! There is even potential for your monitoring system to actually miss this amount of down time. Imagine if you're checking uptime every minute, you could have just missed your downtime issue and falsely reported that you are 'up'.
Now extend those thoughts to consuming public cloud services like Google Cloud. Introduce a support request round trip and you've likely consumed your reliability error budget.
Super interesting thoughts on just how "available" does a system need to be and what implications does it have.
The StayPuft Man
Ok I admit it - I couldn't help myself....today I ran one of the mock tests. You know just to understand my gaps.
I managed to get 7 out of 13 correct - about 53%.
My gaps at this stage were around specific Google API's specifically within Stackdriver and recommended security practices.
To anyone that has worked with me - that probably isn't new information #cowboy
Top comments (0)