shelby spees (she/her)

Posted on Oct 15, 2020 • Edited on Oct 16, 2020

Have you ever been on-call? What was it like?

#discuss #oncall #ops

Would you do it again? What would you change?

Top comments (16)

Alex Vondrak • Oct 16 '20

It's been complicated. 😅 I was part of a small team for many years, and devs were expected to be put on the rotation sooner or later. Over time, my relationship with being on-call changed in several ways - as did my relationship with coworkers, code bases, alerts, and so on.

I've had pagerduty alerts that led to hours of intensive debugging, escalation to several people, patches across multiple systems, in-depth postmortems, etc. Some of those fires happened in the middle of the day, which is almost no problem at all: everybody's in the office to collaborate and no one gets woken up. But most people think of being on-call as getting paged in the middle of the night or weekend or holiday. That comes with the territory, and of course I have those war stories. Nights when I couldn't get onto the VPN, didn't have access to some system, or didn't know what was going on, so I had to call, roll over to voicemail, and redial until some coworker (or coworkers!) woke up. Some nights the issue "fixed itself" by the time I'd rallied the troops. Most nights were more mundane, though. In a sense, that's what you want: no outages or outages you can handle easily by yourself.

I've been through various attempts to improve the process: training newcomers, writing playbooks, systematizing postmortems, tuning alerts. Upon reflection, I think most of it never really stuck. A lot of it was social. Employee turnover meant that, over time, the core group of developers who'd been there for 4, 5, 6 years were the ones who knew enough not to need training or playbooks. They owned their respective systems and were solely responsible for their respective outages. We all knew each other and got into a collective rhythm.

It certainly was a culture of choke points, though. The indispensable few who could resolve certain systems' outages. But I think it got even worse than that as time went on. I'd say it was directly related to the flavor & quality of alerting. Our alerts were only really keyed off of infrastructure-level metrics: healthy host count, latency at the load balancer, CPU utilization, pingdom checks. An alert on such metrics isn't in itself actionable; you don't know why the host count has surged, or if it's even a problem. Alerts were almost never about something you did as an application developer¹, and yet the developer was put on-call for their application.

So eventually I became conditioned to either (a) ignore alerts² or (b) blindly escalate them to the one guy who was up at odd hours anyway and was better at troubleshooting AWS. So the choke points got even worse as I took myself out of the equation.

It didn't make for a healthy organization. But improving alerts & the resulting culture is a whole other topic unto itself, and I can't really say that I've been through that transformation. 😬 So I guess talking about it just serves as catharsis for my years of pain. 😭

The upshot is that the quality of the on-call experience is directly related to the quality of the alerts. I guess this should come as no surprise. Put simply, if I'm going to be woken up in the middle of the night, it should be for a good reason. If I'm doing my job right, those reasons should be few and far between.

Or at least nothing acute. I remember one wild goose chase where an alert in system A led the CTO to see what looked like high latencies in system B. After hours of theory-crafting & fiddling with various things in the middle of the night, it turned out B's latency had been at that level for ages and was just a red herring. The AWS dashboard had been flashing red (although not wired to a pagerduty alert) because of some arbitrary threshold that was set years ago. If it was an issue, it happened incrementally over time, and no one had been watching. 🤷
After enough insomniac nights of:
- wake up to an indistinct alert
- find my laptop
- type password wrong a couple times
- 2FA into the VPN
- see if the service Works For Me™ (cuz there's no signal beyond "number is below threshold")
- squint at dashboards in vain
- see no further signs of smoke
- resolve the alert
- fail to go back to sleep after funneling all that light into my eyeballs
I think I can be forgiven for disregarding The Boy Who Cried Wolf.

Kevan Carstensen • Oct 16 '20

This post has rekindled my deep sense of gratitude toward our alerts that skew (for the most part) quiet and meaningful.

Robert Myers • Oct 16 '20

Yes, in multiple places, and it can be incredibly annoying, depending on how good the instrumentation is.

One place I talked them out of it, at least for nights. I calculated how much it cost for me to deal with the page, and also calculated how much business could be lost. There was several orders of magnitude difference, hundreds of dollars vs pennies, so the pager stayed off at night. So always make sure there's an ROI.

Far too common, happened just about everywhere I've ever been on call , is getting pages for events you can't do anything about. Either something that self resolved by the time it was looked at, or something I couldn't deal with. Site being down because of a hardware failure that IT was looking at, or someone doing a late deploy. Waste of time and sleep.

I agree, make sure you get paid, and also communicate that your daytime tasks might be affected by interrupted sleep, either by a page or concern of a page.

Alex Vondrak • Oct 16 '20

Far too common, happened just about everywhere I've ever been on call , is getting pages for events you can't do anything about.

Right in the feels. 💔 This is the problem with most alerts I've been through. Either it's just a noisy signal that doesn't indicate an actual issue, or the issue just isn't actionable.

Robert Myers • Oct 16 '20

What's even more annoying is the times you DON'T get paged.

At one place where you'd a page if the wind blew wrong, got a call from Customer Service "Did you know the site was down?" Nary a peep from the pager.

Zohar Peled • Oct 16 '20

We have a rotation, but only for devs that wants it. We get paid a little more, and 1 or 2 times a month we need to be available to support one of our main customers after hours - either on weekdays (Sunday evening through Thursday morning) or weekends. Most of the times we don't get called, so it's nice. When we do get called it can be annoying.

James Bubb • Oct 15 '20

Yeah, once over a festive period. We all had Pager Duty installed on our phones and would get a call if a system alarm got triggered.

I must say, I didn't sleep very well every night I was on call. It was more of the expectation of getting a call and also checking that I hadn't missed one that kept me up.

But we got paid a bit extra so wasn't all bad 😉

Tommy Brunn • Oct 16 '20

I have been on call, and for any system where the impact of being unavailable outside of office hours are significant enough to justify the cost, I would advocate for always having an on-call rotation. There are really only two alternatives, and you can probably guess which one businesses would choose given the options.

The service goes down on Friday night and doesn't come up again until Monday when people are coming back to work.
Some poor developer gets called on their personal phone and guilt-tripped into spending their Friday night doing unpaid work to get the service back up. Most likely they won't be compensated beyond perhaps equal time off and maybe a free lunch.

If you're having an on-call rotation though, it's important that the developers themselves have not just the responsibility for the system health, but also have the resources and mandate to ensure it. For example, if I get a false alarm on a Saturday, on Monday my top priority is going to be improving the alerting to make sure that that doesn't happen again.

It goes without saying that you should also compensate the developers for their time. At a previous workplace, we had what I considered a reasonable compensation where you got paid an hourly amount based on your salary any time you were on call, plus a significantly higher hourly amount any time there was a turnout outside of office hours, and a day off after being on-call for 7 days (a legal requirement in my country). I like this structure because it actually aligns incentives where I don't want to get woken up by alarms in the middle of the night, so I'm going to build my services to be resilient, and the company will encourage me to do that because they would rather not pay me a fairly substantial amount of money if we have constant on-call turnouts.

shelby spees (she/her) • Oct 16 '20

it's important that the developers themselves have not just the responsibility for the system health, but also have the resources and mandate to ensure it.

100% this! I appreciate this post by Charity Majors (full disclosure, she's CTO at my company) on how managers can make on-call more humane by empowering devs to actually fix things.

John Peters • Oct 16 '20

Yes multiple times, different companies.

The worst was carrying a pager all weekend, a virtual prisoner to a device.

Other badness is getting that one call at 1am which takes 3 hours to resolve. Then getting in late next morning feeling like a zombie.

Doaa Mahely • Oct 16 '20

Never ‘officially’, but I get called sometimes when fires happen. There was one stretch where I was the only developer available and must’ve had 4 or 5 fires within a few weeks. It’s always anxiety inducing when I get the call because I would have little idea what it’s about.

I don’t know how I would cope with being on-call officially, whether I’d be able to relax and forget about it or remain a ball of anxiety throughout.

Kevan Carstensen • Oct 15 '20

I've worked at a few places without an official on-call rotation (save for informal "who is available on thanksgiving?" spreadsheets). I felt like I was always on-call, which made it really hard to unplug. Would not do this again.

My current company has a small rotation of engineers who get paged in the middle of the night for critical system-level issues. The pages are infrequent (a few times/quarter) and actionable/important. The people on the rotation have enough general knowledge of our systems to deal with the pages. We have tooling wired into our PagerDuty config so non-engineers can page the right people if they need help or spot something wrong. It works really well. Getting paged is still a downer, but trustworthy monitoring/alerting and clear expectations about who deals with pages is a big help in terms of being able to unplug (I very rarely feel the temptation to check Slack on Saturday afternoon "just in case", for example).

I've heard of some places offering extra pay for people who opt-in to the on-call rotation. I like this idea (it's like paying literal interest to service tech debt), and it's something I'd ask for if I worked somewhere where I was paged multiple times per rotation.

Ben Sinclair • Oct 15 '20

I used to do this as a sysadmin, and these days I sometimes do it as a developer.
It's ok.

Don't let them make you do it on the regular without getting paid for it!

shelby spees (she/her) • Oct 15 '20

Definitely, the expectations need to be set up-front and pay negotiated to reflect the extra work, either as an overall salary increase or as overtime pay for non-exempt workers.