Questions To Ask Yourself Before Accepting A Software Engineering Role That Involves On Call Duties

#sre #devops #oncall

Recently, a friend asked me, "You were on call as an engineer right? How did you feel about it?" They're thinking about accepting a software engineering position that requires on call.

Throughout my 6 years of experience as a software engineer, I've had 2 roles that required on call shifts.

There is a division in most organizations between Software Engineers (SWEs) and Site Reliability Engineers (SRE)/DevOps folks. SRE and DevOps have slightly different meanings but I'm using them interchangeably here. Typically, Software Engineers write code, and SREs ensure that code runs smoothly in a production environment. SRE/DevOps folks write code too, of course, although they typically work on different flavors of projects from SWEs.

DevOps is difficult. It's psychologically taxing if people only interact with you when they have a problem. I truly appreciate all the hard work DevOps and SRE folks do to keep the servers running! The psychological toll is exacerbated by the divide that sometimes exists between SWE and SRE folks. "It works on my machine, prod is not my problem" is a problematic but sadly common SWE attitude.

Having SWEs go on call to support the code that they write is a good thing, organizationally speaking. Reliability should be everybody's responsibility. SWE on call bridges the SWE/SRE empathy gap by making everybody invested in operational excellence.

That said, work life balance is also important. If you're considering accepting a SWE job that requires on call duties, here are some questions you can ask yourself to assess the operational health of your potential team.

How many people are on the rotation? On my previous team, we had 7-8 people. Going on call for a week once every two months didn't mess with my life too much. Relatedly, does the manager have plans for how to not burn people out if suddenly the rotation shrinks a lot?
What are the state of the services you're supporting? Are they a pile of technical debt, held together with TODOs and bubble gum?
Do people maintain their alerts and adjust thresholds? Nothing sucks more than getting repeatedly woken up for something that is not even a problem due to a poorly tuned alert.
Are the runbooks in good shape? Do they describe in detail how to access the different fleets of servers and run common commands?
Are production readiness reviews conducted before launching new apps and services? Do people actually have good discipline about incident review and remediation? Do the SWEs partner with a SRE/Ops person to iterate towards better reliability practices?
How fast are you expected to respond to pages? I've heard Google.com traffic SRE's have like a 30 second SLA, meaning they have to get coverage to go to the toilet. 5 minutes is a lot more human.
What are escalation paths like if primary on call can't solve the problem at hand?
How good is the tooling for debugging production problems? Does your company pay for services like Honeycomb.io or Datadog? Or are you stuck with some poorly documented homegrown artisanal distributed tracing framework?
What's the deploy cadence for the services you'll be supporting? More frequent deploys make it easier to identify the specific commit causing a problem, as there's less commits in each release. Since the solution to many operational woes is rolling back the last deploy, how fast is the rollback process?

Of course, no team will ever have perfect operational health; at some level all code is technical debt.

If the team is by and large following reliability best practices, the occasional on call shift can be a little novel, even exciting. It's a chance to learn new things and have an impact.

SWEs, have you ever taken a role that involved on call? If so what did you like or dislike about it?

SRE/DevOps folks, what are some ways you'd like to see SWEs work more effectively with you?