On-call is about more than just reducing mean time to acknowledge and mean time to resolve (MTTA and MTTR, respectively), it's about improving the human experience on your teams. That might seem odd; after all, doesn't a shift to on-call usually mean teams start working unfamiliar hours? Possibly even outside the work day and on weekends?
It's true that being on-call can mean changing hours, but it also means shifting workflows from a difficult or "frictionful" state to something easier to use and understand. This will in turn make it easier for your teams to work together, solve cross-team incidents, etc.
What is on-call?
Before you get started, it is very important to decouple what you need from on-call from any preconceived notions of what on-call must be. Succinctly:
On-call is a response model that codifies hours of coverage, ownership, and escalation.
So what does on-call look like? It can vary from company to company, or from service to service, depending on criticality. Here are some examples of simple on-call models:
- One person, with a backup, is on-call for 24 hours per day for a given number of days (usually one week).
- "Follow the sun" for geographically dispersed teams, where people in each timezone work their normal working hours but in combination they provide coverage for longer than an 8-hour day.
- Shift schedules where there are two or more 8-hour shifts that in combination cover longer than an 8-hour day.
- Business-hours only coverage for services that do not require extended coverage.
In the above examples, the first three are for services that require longer support than an 8-hour working day can provide. This is common for Tier 1 services that are critical business functions. Business-hour coverage is common for providing support for services that have same or next-business-day Service Level Agreements. Another benefit of a Business-Hours-Only schedule is that you can create one for InterruptDuty, giving the physical or virtual "walk ups" a person to ask questions while others on the team continue focus work.
Why you might not have on-call today
If you're reading this, it's very likely that you either don't have on-call established yet—but are looking to start—or that you've only very recently started your on-call journey. Commonly, this is due to a combination of age and size. Response times and patterns of today, versus a few years ago, or versus a decade ago, are all very different scenarios. On top of that, software release cycles were also slower and usage didn't peak in the same ways that it can now. As a result, the needs of even a tenured company would have been very different even less than a decade ago. Essentially, for new companies, a lack of on-call can mean that the business hasn't grown yet to necessitate it, and for established companies it can mean they grew with a different structure.
Working for a company that is more than a few years old is not the only reason that a company might not have an on-call rotation. Two other important factors are company size and customer size. These usually go hand-in-hand. Smaller businesses and startups don't have the scale and impact with their incidents as larger corporations and have significantly fewer employees. For the applications and services, this can mean that the agreement promised is the same or next business day, so there is no out of hours coverage. In terms of "knowing who to contact for an incident," when a company is perhaps literally the size where everyone is working out of a garage, basement, or similar, then there simply aren't complicated logistics around contacting your teammates. Everyone who knows everything to know about that company is in shouting distance, so you give a shout, or you can @here in #general to all 5 people in there. But then the company grew out of the garage, out of the basement, into an office space, into several office spaces. The customer base grew. The demands grew. Suddenly knowing who to contact and how to reach them, as well as how to meet increasing uptime demands, became problems to solve.
Design for people, with people
There are two top-level needs that need to be met with on-call: the business needs and the needs of the people that make up the business. To put this another way, it is equally important to understand what business requirements are being solved by on-call and what people's needs are so they can implement these on-call schedules.
Understand the what, why, and how of on-call
In a shareable brainstorming document, start to think about the "what and why" of on-call—what issues or gaps in the business that you are looking to solve. Some common reasons include, but aren't limited to:
- Improving team communication
- Reducing response times
- Improving response quality
- Reducing response stress
- Codifying ownership structure
- Gaps and/or bottlenecks in workflows and processes
Next, start to dissect why what you discovered needs to be improved or changed. This will guide later discussions for what changes need to be made in order to genuinely improve the top-level concern.
A quick example to use as a guide: let's say that "reducing response and resolution time" is on the list of what to improve. Typically this is measured by MTTA and MTTR. The broader context of why to improve this is how the incidents are impacting others. Depending on the conditions of the incident, it could be inhibiting or completely blocking work that is built on top of that service by either internal or external users, i.e. colleagues and/or customers. So in this case, a complete statement with context would be "improve service reliability internal and/or external use by reducing MTTA and MTTR."
Qualifying information for more subjective needs is equally important. When you are asking for "improved quality", the current and desired quality both need to be outlined, or else you and your teams won't know what to specifically improve. An example in this area might be looking at the postmortem process, where the process occurs but perhaps doesn't have enough detail to learn from in the future, or the process lacks a single owner so documentation might not fully complete at all.
Once you have this level of detail, you can explore how on-call will solve what's on the list. A necessary component of creating on-call rotations is having an ownership model, as this determines who is reached out to when there's an incident or issue. So an example statement would be "as part of creating on-call, we will need to create and implement an ownership model, which will improve inter-team communication by documenting who to contact for a given issue". Another statement would be "incidents are significantly delayed by lack of internal documentation and knowledge transfer, we will implement a knowledge sharing source as part of our on-call implementation and update alerts to include links to the documentation in the body text". An aggregate of these solution statements will determine how you build your on-call culture to meet business needs.
Understand the what, why, and how of people
When implementing a massive change there will likely be resistance and on-call is no different. Some of this will be solved by having a clear understanding of what you hope to accomplish, how you hope to accomplish it, and working with your teams and leads to build this understanding. But even once you have all this in place, there will be concerns around the implementation and what it means for the people who are doing the work. To provide context for this next conversation, it's important to understand what motivates people. A common framework for intrinsic motivation states that people are motivated by:
- Autonomy
- Mastery
- Purpose
Breaking these down, let's look at how they impact on-call scheduling. Autonomy comes into play when discussing not only incident resolutions, but the ownership of the structure of the on-call rotation itself. Basically: if the on-call structure is unsustainable, who has the autonomy to fix it? The most authority resides in management, so management would need to gather feedback, tools, as well as empower their teams to correct issues with the on-call structure itself. What are some situations that create an unsustainable on-call?
- Small team sizes resulting in people being on-call too often.
- High frequency of incidents and/or long duration of those incidents.
- Not addressing known peaks or troughs.
- e.g. A hotel chain during high travel season or retail in gift giving seasons both experience known surges in traffic at that time, as well as lulls in traffic during the "off season". Are the same people on-call for all of the peaks? If the peak is prolonged, for weeks or more, is the schedule adjusted to accommodate?
- Other planned work not being deferred / adjusted to account for on-call duties.
- Small response size at more senior positions, meaning a shorter rotation at higher levels of escalation.
Many of these issues are problems that teams can resolve with management approval, mostly centering around ensuring that the additional on-call duties do not result in burnout. Using the specific example of other planned work, the teams can plan their overall workload around the known on-call rotation and for those on rotation, they could take less time critical work, be on InterruptDuty, or whatever the team needs.
For mastery, IT specialists spend a lot of time learning, iterating, and improving their craft. One of the tangential benefits of on-call is that on-call teams will have greater visibility into the design decisions as well as shared ownership and workload. When development and operations specialists are on-call, they can make use of these to improve their mastery of their craft. Specifically, that increased visibility means that they will know which decisions lead to more, or longer, incidents or created other downstream problems that might not have been visible when duties were separated. Responding and resolving incidents that occur also helps people develop more holistic views of the overall environment and plan future work that can improve design.
Purpose is a mixture of the direct issues that on-call is resolving, e.g. specific issues around response and service ownership, as well as tying this into the overall purpose of the organization. This would be explicitly tying the changes to engineering to the overall organization. This can start with discussing how the improved response time and quality improves user trust and experience, as well as what reliability means for the organization itself.
Nothing is 100%: An acknowledgment
With everything in place, you will have a smoother transition and more buy-in for adoption, but there will likely still be people who will not fully want to start on-call and that is both expected and okay. There are going to be people who chose to work at your company because it didn't have on-call, or didn't have on-call past a certain level of seniority, and they still might want that. That doesn't mean that the transition won't still be able to move forward, or that things won't get easier over time - they will.
Where to go from here?
There is a lot of reading that can help you and your teams prepare for on-call. For the rotations themselves, I recommend our Best Practices for On-Call Teams Ops Guide and the On-Call section of the Google SRE Handbook. When you start taking a look at the service level agreements, I also recommend taking a look at the Service Level Objectives section of the Google SRE Handbook to ensure that the objectives and agreements are in alignment, doable with the team and on-call structures, and that the indicators in place measure what is needed.
Top comments (0)