Which of the following three scenarios do you experience the most when a new incident occurs?
• Surprised: You didn’t see the problem coming and it’s sheer luck you get the system back online.
• Prepared: You’ve seen similar incidents happen enough times that you are not surprised, so you know where the issue is and how to fix it.
• Proactive: The incident is novel. You detect it immediately, remediate it swiftly, and prevent it from happening again.
For many teams, incidents unfortunately fall into scenario 1, with some classes of incidents catching them by surprise. It's astonishing that despite the vast amount of time we spend working on and thinking about our systems, we seem to have very little control over them. If we can’t predict where the next incidents will come from, then we will be forever stuck in a reactive cycle of repair.
An analogous example is the famous fable of the Three Little Pigs. We can imagine the various compositions of the three scenarios above as houses of straw, sticks, and bricks. Which house do you currently live in? Which house would you like to live in?
“The Straw House”: In the face of incidents, you are most often surprised, sometimes prepared, rarely proactive. This is a system easily taken down by incidents. You are stuck scrambling to build the house back up after each rain storm or blow from the wolf. You feel stressed and bogged down in pager hell.
“The Stick House”: In the face of incidents, you are often prepared, sometimes surprised, rarely proactive. You see the house creaking and leaking when it rains, so you know which patches are weak. You replace each stick as it breaks - fixing each incident as it happens. But hurricanes still take you by surprise, tearing down the house because you haven’t had time, money or partners to strengthen the structure overall. You feel stuck doing repetitive work.
“The Brick House”: You are mostly proactive about incident prevention, often prepared for and rarely surprised by incidents. You designed the structure of this house to withstand wolves and hurricanes. Before the house was even built, inked in the blueprint were not one, but two layers of brick, sturdy pillars, and an arch that would be tricky to install. Your team comes up with a design up-front to be sturdy and scalable. You prevent classes of issues and quickly mitigate unexpected incidents. You feel calm and in control.
Moving to a brick house is never by accident. Rigorous culture practices are the bricks that make up a sturdy house. Writing postmortems well is one such culture practice (it was covered in great length in Part 1 of this two-part series). Here we will elaborate on 4 additional culture practices that will fortify a company’s reliability.
This is part 2 of the interview with the 10+ year veteran Google SRE, Steve McGhee. The practices below will take more effort to implement than the postmortem practices introduced in Part 1. Much like in recruiting, short term productivity loss is a necessary sacrifice for long term execution gain.
(Disclaimer: Please note that Steve gave this interview prior to re-joining Google, so this interview is not a statement from Google nor does it represent Google’s view.)
As mentioned in Part 1 of the article, leading teams out of pager hell and into a sturdy house of bricks is only possible with thorough and rigorous processes. One such process is the prioritization of incident-causing bugs. Any existing bug that "touches" a postmortem is to be given a +1 importance (e.g. P2->P1). Of course, this needs to be agreed to by management first, as priority-inflation left unchecked can result in significant impact to existing schedules. Remember, if everything is a P1, nothing is.
This practice feeds into the culture of bug triage and hygiene. You can adopt a separate periodic review of "postmortem-tainted bugs" to assess the speed of their resolution. One way to monitor the speed is to set a SLO, an internal objective. For example, "all incident-causing bugs will be resolved within 30 days.”
A good monitoring system can be a game changer for speeding up incident resolution. Imagine a monitoring dashboard system that can be easily customized on-demand by a team, for each incident. When an incident happens, you can slice data in different ways to compare the resulting graphs on the same interface. A good example is Google’s Viceroy console system, which provides a consistent interface for defining, viewing, and drilling-down into graph sets.
However, avoid trying to pre-build “the perfect console”. Instead, build a flexible console system that allows for quick development of an ephemeral console for a specific alert. In an ideal world, all relevant team members can collaborate real-time in the console for an incident. Graphs from the console are included in the postmortem to help readers understand what happened.
What are the hardest incidents to detect?
Pervasive slow burns. These issues often start off slow but hit a tipping point that cripples the system. We often see this with capacity monitoring. Even though we are not at capacity for the metric we care about (e.g. RAM), there’s a different resource (e.g. threads) that is limiting us. So we would unexpectedly run out of threads before we run out of RAM. Issues like these can only be uncovered by deep monitoring. But deep monitoring is not enough, high signal and low noise SLOs are critical to maximizing the utility of deep monitoring. But monitoring and SLOs improvements cannot be accomplished by SREs alone.
SREs and the development team need to work together to constantly update and fill gaps in monitoring, as uncovered by postmortems, to create a cycle of continuous learning and growth.
SREs are good at discovering and highlighting good metrics based on infrastructure (both hardware and software), but might not immediately know what to look for in a given piece of application software. For example, exposing a counter that tracks the number of times you reshard an internal data structure is something that the developer of a service would easily determine is a good idea. An SRE that is new to a product might have to read and understand every line of code before realizing that counter is an important metric to track.
Having the development team expose service-specific metrics, plus an infrastructure/platform/SRE team exposing service-agnostic metrics will give you the best chance of having the right metrics in place, findable during an outage.
Lastly, Steve’s favourite analogy for the incident resolution process is “pruning the cause tree”. Looking for the cause(s) of an incident is like running a tree traversal algorithm. You don't want to execute a depth-first search, instead you want to find ways to prune entire subtrees of cause as early as possible, so performing a breadth-first approach with this in mind is a good way of reducing the possible search space.
This is much like symptom-based monitoring, not cause-based. Don’t guess at a cause and look for evidence. Instead, start with the user-pain, and follow the evidence trail. From searching within a group of microservices to locating the error within a specific function, every step narrows the problem search space.
Be sure to go through the narrowing process in the open so others benefit from your pruning. For example, in the working postmortem document, an on-caller can say, "I've looked into and I know that it's not there for .” This way other on-callers don't waste time (potentially in parallel) searching the same path.
Starting your analysis with a proposed cause for an outage is like traversing a tree from arbitrary leaves. Instead, you should start from Symptoms (SLOs), at the root and prune the tree until you arrive at the real Cause. Science!
Even with exceptional reliability and rigorous culture practices, it’s vital to point out that not every incident is preventable. The truth is, you are always going to be unavailable sometimes. However, you can either resolve incidents and forget about them, or you can try to understand why the house fell down and systematically fix deeper causes.
Our ultimate goal is less about maximizing reliability, but more about gaining greater control over our reliability. We want to deeply understand why we are unavailable through postmortems. Only then can we prevent some of the incidents and build a stronger house through culture practices, brick by brick.
This is the second article of a two-part series. Click here for part 1 of the interview with Steve McGhee.
Edited by: Charlie Taylor
Written by: Steve McGhee, Rui Su