Originally published on Failure is Inevitable.
On-call: you may see it as a necessary evil. When fast incident response can make or break your reputation, designating people across the team to be ready to react at all hours of the day is a necessity. But, this often creates immense stress while eating into personal lives. It isn’t a surprise that many engineers have horror stories about the difficulty of carrying a pager.
But does on-call have to be so dreadful? No way. Here are five best practices to help your team respond quicker and build more resilient systems.
Not all incidents are created equal. On-call escalations should only start when it’s worth getting out of bed for. The monitorable metrics, from which you can trigger alerts, might be too low-level to capture the actual severity of an incident. Instead, consider the impact different types of incidents have on your customers. Create severity tiers based on this.
To determine impact, use techniques such as user journeys (where metrics are consolidated based on typical usage patterns) and black box monitoring (where metrics are gathered only using what external customers can see). These will help you break down an incident into specific metrics you’ll monitor to trigger alerts. This also helps you cut out metrics that only make things noisier.
Once you have your metric, make sure your team is in agreement on how to classify incidents and what response each class requires. Schedule time to review these choices based on retrospectives of previous incidents. Was that Sev 0 actually a Sev 0? Does a Sev 3 need all those people alerted? Your classification system should be logical and consistent.
Knowing the difference between a Sev 0 and a Sev 3 incident can save you from opening your laptop at 2 AM. It can also save you from underestimating a critical, customer-facing incident.
Imagine an incident that is crucial enough to rouse a team member in the wee hours of the morning. What can your team do to help them resolve the incident and get back to bed as fast as possible? The answer is a runbook.
A runbook is a set of detailed instructions for resolving each type of incident. This guidance helps ease the cognitive burden of on-call troubleshooting. It also contains specific commands to execute or places in code to check.
- Escalating incidents - whom to notify and when
- Assigning roles - who will handle what if things escalate
- Retrospective creation - document decisions made and communications
Creating runbooks will also help you discover procedures that can be automated, or toil that can be alleviated through tooling. Give engineers time to vent about what processes are most tedious and seek to improve them. Working through these steps at 3AM is way harder than writing them down in the afternoon.
Despite all the planning required, incidents often need more than a standard solution. Good runbooks should also leave space for the engineers’ creativity.
It can be tough to strike this balance of freedom and guidance, so aim for continuous improvement. Engineers using the runbooks should review them on a regular cadence. Make analyzing the runbook’s performance part of your retrospective review. Ownership of the runbook ensures that every engineer is confident in implementing it.
Load balancing people isn’t like load balancing servers. Numbers will never tell the full story, and isn’t as simple as giving people an equal number of shifts. The goal is to ensure engineers don’t burn out or feel they’re receiving unfair treatment. Any given on-call shift could be blissful silence, or a sprawling maddening disaster. When people feel they’re taking on a disproportionate burden, morale can drop and on-call dread can rise.
A good first step in distributing the most challenging incidents is using your severity classifications to estimate the workload of on-call shifts. An incident’s severity doesn’t always reflect the difficulty in resolving it, though, so also incorporate metrics such as time to resolution.
Most importantly, listen to your on-call engineers. Use incident retrospectives to discuss the impact of on-call incidents. Create qualitative metrics to describe how burnt out on-call incidents leave their respondents.
Establishing a system to manage on-call load isn’t easy, so continuous iteration is key. Buy-in from your on-call teams is essential, so make sure they’re involved in evaluating the load. Try techniques originally used to plan development time, such as [story point estimation(https://www.atlassian.com/agile/project-management/estimation). These techniques are useful for helping teams collaborate on estimates of on-call load.
Blameless retrospectives can help discover the true systemic causes of incidents. This allows teams to proactively address reliability issues. You can use metrics such as time to resolution or severity generated across a variety of retrospectives to find recurring root causes. Then, you can prioritize development to resolve them. Remember that reliability is a feature. Reducing unplanned work through reliability engineering is as important as new features.
Incorporating other SRE principles will be instrumental in reducing your on-call load. Building service level objectives provides a safety net, warning you of potential crises before you're paged. Chaos engineering techniques, such as simulating incidents and practicing responses, can also help uncover areas of underpreparedness.
You can also be considerate of on-call engineers by looking for patterns of incident timing. Although the business impact of an outage happening in the afternoon or in the middle of the night could be around the same, the outage that wakes up an on-call team has a greater human impact. Include these human considerations when looking for reliability areas to address.
At the core of all our best practices is empathy towards the on-call engineers. Bake these empathetic practices into your culture to ensure that on-call decisions keep the human in mind.
Celebrate the on-call team's successes, emphasizing the challenges team members had to face. On-call incidents can begin and end in a single night, leaving others unaware and the responders feeling unappreciated. Recognizing on-call efforts can help motivate engineers and reduce burnout.
Try to shift the perception of incidents from unavoidable setbacks to unplanned investments. Every incident is an investment into learning and an opportunity to improve future incidents. Likewise, every on-call shift is an investment in learning to improve on-call going forward. Championing this attitude is powerful to make on-call a meaningful challenge.
Sure, on-call might never be something that engineers sign up for. But it shouldn’t be something you dread, either. The most important thing is to reduce the pain of on-call, and these best practices are a great place to start.