Who’s bug is it anyway? Aka handling the on-call rota

#engineering #devops #oncall #management

Being the person responsible for dealing with issues as they come in (aka on-call) is probably one of the most disliked aspects of the job, with feedback ranging from “it sucks” to “it’s alright”, so you know it’s always a challenging topic of conversation.

I’ve spoken to a few people about on-call, and the general feedback is that it doesn’t work for them. From the feedback, complaints ranged from, the rotas aren’t right to, not being paid enough for the volume of issues especially out of hours calls, to, people don’t know how to fix all the problems so don’t want to get involved.

The reality is, you won’t be able to solve these problems head on without compromise. Instead, it’s best to accept this and optimise for what will encourage the right behaviours.

Encouraging the right behaviours

The burden of issues should not be on one individual or a siloed team. Supporting products, services, and features should belong to the teams that build them.

Teams need both the autonomy to release and operate new products and features and the responsibility for supporting them. We shouldn’t treat support any differently than application configurations or infrastructure. It’s a requirement.

Developers should be responsible for responding to issues, outages and being on call, as well as designing and evolving the monitoring and alerting systems, and the metrics behind them. This means that the developers control the entire lifecycle, including operational tasks like backup, monitoring and, of course, on call rotation.

This way, they become more aware of the pitfalls of running code in production as they are the ones handling the incidents (waking up to alerts, etc).

This approach means that they are incentivised to put sufficient measures in place to avoid it happening. Things like monitoring (metrics and alerts), logging, maintenance (updating and repairing) and scalability become key considerations behind every line of code they write.

In the event of an incident that touches multiple teams, the issue is manually escalated to an Incident Manager, typically a Department Head or nominated representative, who would remain involved until the incident is addressed. We then utilise the root cause analysis approach to a blameless post mortem, to diagnose and address the problem.

I appreciate that this approach might not be popular with everyone and that people may be sceptical but this is a tried and tested option that should be considered.

What does not work

Let's face it, nobody likes being on-call. In the feedback, nobody said anything good about it, on a scale of OK to it sucks, you can guess where most people landed.

By attempting to manage on-call rotas at management level we set ourselves up for failure as it doesn't empower the people on-call, it requires continual intervention and will not scale. Sounds great in theory, but can it work in practice? Encouraging the right behaviours would only be theoretical if it wasn't feasible, let me explain...

One option to encourage the right behaviour would be to pay more for on-call duty. The problem is that this is favoured in only one way. That is, the developer gets paid for issues. This is not encouraging.

An issue not addressed by many rotas is that the switchover at midnight isn't practical, although sunrise is suggested instead, it doesn't solve the actual problem.

Two issues that most people are concerned about are that they feel that they don't know enough to solve the problems or it doesn't fit with their lifestyle. So how do we solve those issues?

What happens elsewhere

At Google, they hire dedicated Site Reliability Engineers. Once products or services become stable they are handed over to them to manage. This is a very practical option but it is probably more expensive and can lead to developers simply throwing work over the wall to operations, rather than taking responsibility. It can also cause an “us” and “them” type culture, which is not great for culture.

At Spotify they tried this and found that it led to a bottleneck in their Ops team. They instead opted for a redistribution of ownership. Now each team is responsible for support.

At UK government digital services, they too had a dedicated technical support team but found that in practice it didn't work. One of the problems was knowledge sharing. They have worked really hard to try and solve this problem but admit that it's still work in progress.

Netflix is probably most famed for their approach of "NoOps" to this problem. An engineer will generally have pagerduty 1 out of N weeks where N is the number of engineers. Some teams include QA engineers and managers in the on-call rotation and others don't. Some teams only include key engineers on pagerduty rotation while others will include all engineers on the team.

At Etsy, all their engineers are on call too. They have created an open source tool called Opsweekly that allows you to both organise your team into one central place, but also helps you understand and improve your on call rotations through the use of a simple on call "survey", and reporting as a result of that tracking.

Meanwhile at Facebook, two or three times a year, Facebook engineers have to completely rearrange their lives for two weeks while they serve something called "oncall duty." Basically, during that time, they are responsible for keeping the service up-and-running. Oncall duty lasts for two weeks and it rotates between Facebook's engineering teams. a Facebook engineer named Keith Adams says it is "the worst thing about working at Facebook."

At Amazon, it's less clear, they seem to rotate on-call to teams every few months for a week at a time and this approach isn't very popular from what I've read.

How to solve the problem

We want to reduce the burden on the person in the team with the most Site Reliability Engineering experience and encourage more engineers on the rota.

All of the platforms I mentioned seem to have one thing in common, that is on-call is a mandatory part of the job.

You'll notice another theme is that on-call is assigned at team level and the teams take responsibility for choosing their rota.

My view is that every engineer is responsible for support and monitoring including on-call and that rotas are coordinated at team level. My suggestion would be to rotate them every two weeks across teams in line with a typical sprint length.

If it’s not practical to get everyone on-call, then you've really got to question why it’s not practical. If you’re not addressing the underlying issues head on, then any alternative solution is just a compromise and is going to be more expensive and will be to the detriment of the team and culture.

If it’s not practical to get everyone on-call, then you will need more budget and resources to cope with the same recurring issues.