These are the notes from Chapter 18: Software Engineering in SRE from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
SRE book notes: Postmortem Culture
Hercules Lemke Merscher ・ Feb 1 ・ 1 min read
one of SRE’s guiding principles is that "team size should not scale directly with service growth."
Long-term project work provides much-needed balance to interrupts and on-call work, and can provide job satisfaction for engineers who want their careers to maintain a balance between software engineering and systems engineering.
The desirability of team diversity is doubly true for SRE, where a variety of backgrounds and problem-solving approaches can help prevent blind spots. To this end, Google always strives to staff its SRE teams with a mix of engineers with traditional software development experience and engineers with systems engineering experience.
Don’t underestimate the effort required to raise awareness and interest in your software product—a single presentation or email announcement isn’t enough. Socializing internal software tools to a large audience demands all of the following:
- A consistent and coherent approach
- User advocacy
- The sponsorship of senior engineers and management, to whom you will have to demonstrate the utility of your product
This will be particularly challenging if reliability is already a big issue and the company is trying to curb the side effects by limiting the blast radius. On one hand, the concerns are legit, as a new piece to the puzzle will make it more complex. On the other hand, this piece might be the missing piece of the puzzle to bring some order to the chaos.
Advocacy, cross-team collaboration, and sometimes politics will end up consuming more time than thought.
Context awareness is key!
A diversity of experiences covers blind spots as well as the pitfalls of assuming that every team’s use case is the same as yours.
The majority of software products developed within SRE begin as side projects whose utility leads them to grow and become formalized.
this is a point worth stressing—it’s essential that the SREs involved in any development effort continue working as SREs instead of becoming full-time developers embedded in the SRE organization. Immersion in the world of production gives SREs performing development work an invaluable perspective, as they are both the creator and the customer for any product.
The chapter also explains how SREs at Google do capacity planning, and how they built a software called Auxon to help with that. It's very much worth reading!
If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.
You can also follow me on Twitter and Mastodon.
Photo by Alain Pham on Unsplash
Top comments (0)