loading...
Cover image for Making On-Call Not Suck

Making On-Call Not Suck

molly_struve profile image Molly Struve (she/her) ・7 min read

Back when our team was small, we put together a single on-call rotation. Every dev was in the rotation and would go on-call for one week at a time. When we first started the rotation our team had 5 devs on it. One year after another passed, and despite our team growth we still stuck with the single rotation. Eventually, the team was so big that people were going on-call once every 3-4 months. This may seem like a dream come true, but in reality, it was far from it.

A Broken On-Call System

The single on-call rotation was miserable for just about everyone for a variety of reasons.

Large Rotation

The large rotation meant that on-call shifts were so infrequent that devs were not able to get the experience and reps they needed to know how to handle on-call issues effectively. In addition, our code base had grown tremendously and there were so many things being developed at once that when a problem arose there was a good chance the on-call dev knew nothing about it or the code that was causing it.

This led to panicked developers often turning to the Site Reliability Engineering(SRE) team for help with issues. Constantly having to jump in and help with on-call issues quickly began to drain a lot of the SRE team's time and resources. Essentially, the team began to act as if they were on-call 24/7. The constant bombardment of questions and requests came very close to burning out the entire team and took away valuable time they needed to work on their own projects.

No Ownership

Besides having a burned out and inefficiently used SRE team, another developer gripe about on-call was that developers felt like they had no ownership over the code they were supporting. One person would write code and another person would be the one debugging it if it broke. The app was so big that there was no way anyone could have a sense of ownership over the production code since there was just too much and they were expected to support all of it.

3 Teams, One Application

Due to the size of our engineering organization, we now have 3 separate dev teams. Each team has 5-7 devs on it plus a manager. Each team also is given its own set of projects. However, our main application still is a single monolithic Rails app. All three teams work equally across the entire codebase. Unlike other apps which have very separate backend components owned by individual teams, there are no clear or obvious lines of ownership. Solving this issue would prove to be the hardest task when it came to fixing our on-call system.

The Solution

3 Rotations

We knew we had to break up the rotation if we wanted to continue growing, but the question was how? Despite all of the developers working across a single application with no clearly defined lines of ownership, we devised a plan that broke our single rotation into 3, one for each of our 3 dev teams. This led to shorter rotations, which meant more reps for devs. As backward as it may sound, being on-call more is a benefit because devs have become a lot more comfortable with it and are able to really figure out a strategy that works best for them.

Divided Application Ownership

3 rotations allowed the devs to get more reps being on-call, but that still left the biggest problem of all and that was the problem of ownership. No one wants to support something they don't feel like they own. To accomplish this we choose to split up the on-call application ownership amongst the 3 dev teams. It didn't happen overnight, but with a few meetings and a lot of team discussions, we were able to break up everything in our application between the 3 teams.

  • We broke up all the background workers, for example:
    • Team 1: Indexing jobs
    • Team 2: Overnight reporting jobs
    • Team 3: Client communication jobs
  • We broke up all the individual service alerts, for example:
    • Team 1: Redis alerts, Queue backup alerts
    • Team 2: Elasticsearch alerts, API traffic alerts
    • Team 3: MySQL alerts, User load page alerts
  • We broke up the application components, for example:
    • Team 1: Users and Alert models and controllers
    • Team 2: Asset and Vulnerability models and controllers
    • Team 3: Reporting and Emailing models and controllers

Once the lines had been drawn, we made sure to stress to each of the dev teams that despite doing our best to balance the code equally we might still have to move things around. This showed the devs that we were fully invested in making sure this new on-call rotation was fair and better for everyone.

After the code was split up the SRE team took time to sit down with each dev team to thoroughly review the app components, workers, and alerts they now owned. We went over everything from common issues to exactly what every single piece of code did and how it affected the rest of the application. These sessions have given devs a lot more confidence in their ability to handle on-call situations because they now have a clear picture of what they own and how to handle it. Even though they haven't built some of the code themselves, they have an understanding of exactly how it works and what it is doing.

In addition to giving each team an education over their section of code, we also took advantage of Gitlab's CODEOWNERS file. The CODEOWNERS file allows you to specify who or what teams in your organization own a file. When that file is updated by anyone in a PR the owner of the file will automatically be tagged for review.

Reasonable Fallbacks

Originally the SRE team was the fallback for the on-call dev. If the on-call dev had questions or needed help they would talk to the SRE team member that was on-call that week. Our SRE team only has 3 members currently so you can see why we got burned out being the constant fallback. With the new system, the 3 on-call devs all act as fallbacks for each other. If any of them get overwhelmed or stuck on an issue they are encouraged to reach out to one of the other on-call devs for help.

More Focus

In addition to the above changes, we also removed some duties from the on-call devs. Prior to these on-call rotation changes, the on-call devs were responsible for determining whether a status page or any customer messaging was needed during an incident. We have since moved that responsibility to the support team. The support team is the closest to the customer, and therefore, are the best equipped to communicate any problems. When an incident occurs that affects customers, the support team is notified and is responsible for determining if a status page or any customer communication is needed. Giving this responsibility to the support team allows devs to focus solely on diagnosing and solving the problem at hand.

The Payoff

Improved Alerting

Originally, the SRE team had set up all the alerting and monitoring tools. However, once we turned the alerts over to each of the dev teams they took them and ran. Because each team felt a renewed sense of ownership over their alerts they started to improve and build on them. Not only did they make more alerts, but they also improved the accuracy of the existing ones.

Sense Of Ownership

Even though one team might edit the code that another team supports, there is still a keen sense of ownership for the supporting team. The supporting team acts almost as the domain experts over their section of code. Using the CODEOWNERS file ensures that the supporting team is made aware and can sign off on any changes made by other teams. Because each chunk of code a team supports is small, they can actually learn and support it all unlike before where it was all just too much for anyone to handle.

Faster Incident Response

With 3 devs on-call at once and each one of them focusing on a smaller piece of the application, they can spot problems a lot faster. Each team also wants to ensure that when things do go wrong they are caught quickly which is why many choose to tune their alerts to warn them of problems even faster than they were originally set up for.

Triaging and figuring out the root cause of issues is also a lot faster. Teams are intimately familiar with their alerts and the pieces of code they own which allows them to figure out problems quicker than before.

Never Alone

Having 3 devs on-call at once means that none of them ever feel alone. If things start to fall apart in one section of the application, the dev that owns that part knows there are two others available to help if they need it. Just knowing you have someone else easily accessible can do wonders to your confidence when you are on-call.

Improved Cross-Team Communication

As I stated before, each of the 3 dev teams works across the entire application. This means that there are times when a team might work on code that ends up causing an alert for another team to go off.

For example, let's say a high load Redis alert goes off. It is not a guarantee that the team that owns that alert also owns the piece of the application causing the problem. However, because the Redis alert team is experienced with the alert, they know how to triage it quickly and efficiently. Then, the triaging team can easily hand it off to the team that owns the problem component. This cross-team communication has helped teams stay current with each other's work, but they never feel like they are having to fix other team's code.

On-call Shouldn't Suck

On-call is something that many people in this industry dread and it shouldn't be that way. If people are dreading on-call then something is broken with your system. Sure, everyone at some point will get that late night or weekend page that is a pain, but that pain shouldn't be the norm. If on-call makes people want to pull their hair out ALL the time, you have to figure out the problem and fix it.

Posted on Jul 2 '19 by:

molly_struve profile

Molly Struve (she/her)

@molly_struve

International Speaker πŸ—£ Runner πŸƒβ€β™€οΈ Always Ambitious. Never Satisfied. I ride πŸ¦„'s IRL

Discussion

markdown guide
 

Even with the best escalation and call-rotation structures, you can still burn-out your engineers if you have engineers that are "know it alls" (whether by design or by circumstance). And it's not merely a case of "well, they should document and train the rest of the team better": some people just naturally remember minutia and esoteric information about technical solutions - stuff that, even if you did document it, no one would really know how to find that documented solution (or understand it if they did).

At a prior job in the early 2000s, we had a multi-tier, 24/7/365 on-site support staff. Supplementing them was a call-rotation for the senior engineers. Unfortunately, even with all of that, some problems always ended up in my lap because I was the only person who knew the system better than the vendor did. On the one hand, it literally meant an extra car's worth of OT-pay in the span of 12 months. On the other… I can say without hyperbole that 15+ years later, you can still make my wife shudder by playing a sound sample of the old NexTel phone's ringer.

In any case, massive OT-pay opportunities aside (and a year-end spot-bonus), it meant that I pretty much had the option between divorce and finding a new job.

 

I feel this so much! I was that person with all the domain knowledge that everyone would turn to and it eventually did burn me out. Luckily my coworkers intervened.

 

Fortunately, it was more my wife that got burnt out by the long, 3AM/weekend/holiday/vacation phonecalls rather than me. I generally just lumped it under the "meh: it's time-and-a-half and whether this call lasts 1 minute or 50, I get to bill the entire hour" (which can go a long way towards not getting burnt out when you're still in your early 30s). =)

To burn out your family or yourself can't be compensated... This industry needs to change the approach or the big turnover will keep existing.

I've been in IT for 25 years now. Haven't really seen a change in the tendency to throw more and more work at the people that best demonstrate the ability to get things done.

 

Nicely written Molly! Doing on-call is truly the tour-of-duty of our industry. An important, and often thankless, responsibility that has many hidden heroes. It is important to constantly make sure everyone in the org recognizes who is doing a good job of keeping the lights on 99.99% of the time...because everyone knows when there is an outage. Sadly, that is often what users remember the most. Example: The AWS US-EAST-1 Region outage in March of 2017 that was blamed on an employee's error. Ouch.

 

Hey!

This is a great write-up, thank you for sharing!

We had some similar problems, though at a different scale, at Intercom and used a bunch of similar techniques improve out-of-hours oncall. We also emphasized ownership, though basically made the call to have teams oncall for their own stuff during office hours, and a shared oncall team out of hours. We also have a Rails monolith which makes things a bit easier to share the oncall work :)

I wrote about it here if you are interested: intercom.com/blog/rapid-response-h...

Looking forward to giving your Oncall Nightmares podcast a listen :)

 

Thanks! Def will give your post a read over break!

Podcast was just released this morning πŸ˜ƒ Hope you Enjoy! podomatic.com/podcasts/oncallnight...

 

We are also a team of 5 and each member is on-call for one week.
This works well if everybody is working full time (100%). How about if one or more members want to reduce their activity and go part-time (from 100% to 80% or 60%, working only 3 or 4 days per week instead of 5)? The one-week on-call is no longer possible.
Is one-day on-call duration a solution in this case? Any idea is welcome.
Thanks.
Alex

 

One day I would think would be too short and require too much context switching. Half weeks might be a good option. Definitely not easy when not everyone is on the same work schedule.

 

Nice article, Molly!

I have a question for you. How do your on call members get compensated?

The companies I worked for (in Germany) typically had a one week rotation, and the person on call got compensated for being on call as well as the times they actually had to work. These companies also never paid for more than one person to be on call at the same time (so no fallback).

So now we still face the problem that the devs are on call maybe once every three months. This means, they don't get the required experience for general ops-stuff as well as being on call.

Thanks!

 

Being on call is part of our job so we do not make extra money for it, unfortunately.

 

Hi Molly, great article!

Let me expose my, admittedly controversial, view. Isn't the mere existence of a dedicated Team of SREs a problem?

IMO several companies have been drinking the Google Kool-Aid and "rebranding" Infrastructure / Networking / Support people and eventually Devs with some tooling expertise into "SREs". Companies promise more time for people to work on automation and product optimization, but, in reality, they are still a small number of people mostly putting out πŸ”₯ on projects that they know little about. This is still Silo thinking under disguise.

On the other side of the wall, Devs without the required Ops experience are churning out half-baked, overly complex, wasteful and insecure deployments. Great, now Developers "own" it and everyone can be on call stressed, burned-out and panicking. Hooray, DevOps! Quick, let's all send our CVs to the next company! We'll all get a better paying job until the cycle repeats itself.

Alternative solution: Hire more people with Ops experience and a SRE mindset, place them in real DevOps teams (yes, you heard me: At least one talented Ops person per team). Let people with Ops backgrounds and Dev backgrounds really work together and learn from each other from the beginning. Make sure that SREs (as well as Devs) have a good relationship with their respective Chapters so that standards naturally emerge. Devs learn some tricks from Ops experts, write better infrastructure as a code and get to understand / account for what is required to monitor and troubleshoot the product that they are building from the start. Ops actually get the time to work on automation and learn the feature side of things as it's being built (plus they also get proper code reviews and learn a trick or two from experienced Devs. No disrespect but Going through Python / Bash / Ruby scripts written by Ops guys can be a nightmare. As bad as the Terraform / Ansible stuff that Devs put together). With enough time the team finds its pace and agree on a sane on call schedule.

SRE guy/girl is a very senior expert with a great overall picture of the product and the ability to optimize things across the board? No worries... Give him/her (and everyone else) the freedom to move across teams. But still make sure that he/she's part of a team that actually delivers features. Make sure that he/she stays in the team just long enough to spread some of their knowledge to the team, as well as learn more of the specifics of what is being built ATM.

I'm not saying that there's no place for dedicated Infrastructure and First Line Support teams with skilled engineers and innovative solutions. Neither I'm saying that we can neglect the overall picture and the few Devs that can actually handle it. I'm just saying that Google's SRE model is not for everyone. As it stands, it feels like most companies are getting DevOps wrong (as much as they get the core values of Agile wrong). IMO a team of SREs, huge or small, will be always simultaneously overworked (responsibility-wise) and underutilized (in terms of their actual potential). Even if you are blessed with a few people with the rare combination of domain / infrastructure and development expertise to do the job properly, it still sounds like a huge waste of their time and brainpower.

What do you think? Am I right? Or am I'm getting the SRE side of things completely wrong?

 

I don't speak for other companies but I do know for us at Kenna our SRE team has been a god send! Having a team that can focus solely on the reliability and scalability of our system has been a big win for us and it has greatly improved the quality of our platform for customers and for our devs and operation teams internally. I actually wrote a blog post about what our SRE team focuses on.

As for knowledge sharing, we actually do a dev SRE rotation that allows devs to get a peek at what we focus on as SREs. In addition, the SRE team works very closely with our dev teams pumping out features to ensure the features are performant. They are definitely not off in a corner on there own.

Sure, having an SRE team might not be for everyone but in our case, it has been an enormous benefit.

 

My company is betting for this format not recommended in this article: Put more people on the wheel so you are not oncall for ~2 months, but still covering a lot of pieces of the product which are not owned when we are on business hours. Definitely won't work.

To me, the way to go is to rollback always if possible and, on office hours, to fix it properly. It's not fair to WORK outside office hours, no matter how well payed it is.

 

Great post Molly, all of this feels extremely familiar at my company and we're heading in a similar direction. Really nice to see a positive payoff