Back when our team was small, we put together a single on-call rotation. Every dev was in the rotation and would go on-call for one week at a time....
For further actions, you may consider blocking this person and/or reporting abuse
Even with the best escalation and call-rotation structures, you can still burn-out your engineers if you have engineers that are "know it alls" (whether by design or by circumstance). And it's not merely a case of "well, they should document and train the rest of the team better": some people just naturally remember minutia and esoteric information about technical solutions - stuff that, even if you did document it, no one would really know how to find that documented solution (or understand it if they did).
At a prior job in the early 2000s, we had a multi-tier, 24/7/365 on-site support staff. Supplementing them was a call-rotation for the senior engineers. Unfortunately, even with all of that, some problems always ended up in my lap because I was the only person who knew the system better than the vendor did. On the one hand, it literally meant an extra car's worth of OT-pay in the span of 12 months. On the other⦠I can say without hyperbole that 15+ years later, you can still make my wife shudder by playing a sound sample of the old NexTel phone's ringer.
In any case, massive OT-pay opportunities aside (and a year-end spot-bonus), it meant that I pretty much had the option between divorce and finding a new job.
I feel this so much! I was that person with all the domain knowledge that everyone would turn to and it eventually did burn me out. Luckily my coworkers intervened.
I Can't Do It All: My Burnout Story
Molly Struve γ» Feb 27 γ» 3 min read
Fortunately, it was more my wife that got burnt out by the long, 3AM/weekend/holiday/vacation phonecalls rather than me. I generally just lumped it under the "meh: it's time-and-a-half and whether this call lasts 1 minute or 50, I get to bill the entire hour" (which can go a long way towards not getting burnt out when you're still in your early 30s). =)
To burn out your family or yourself can't be compensated... This industry needs to change the approach or the big turnover will keep existing.
I've been in IT for 25 years now. Haven't really seen a change in the tendency to throw more and more work at the people that best demonstrate the ability to get things done.
Nicely written Molly! Doing on-call is truly the tour-of-duty of our industry. An important, and often thankless, responsibility that has many hidden heroes. It is important to constantly make sure everyone in the org recognizes who is doing a good job of keeping the lights on 99.99% of the time...because everyone knows when there is an outage. Sadly, that is often what users remember the most. Example: The AWS US-EAST-1 Region outage in March of 2017 that was blamed on an employee's error. Ouch.
Hey!
This is a great write-up, thank you for sharing!
We had some similar problems, though at a different scale, at Intercom and used a bunch of similar techniques improve out-of-hours oncall. We also emphasized ownership, though basically made the call to have teams oncall for their own stuff during office hours, and a shared oncall team out of hours. We also have a Rails monolith which makes things a bit easier to share the oncall work :)
I wrote about it here if you are interested: intercom.com/blog/rapid-response-h...
Looking forward to giving your Oncall Nightmares podcast a listen :)
Thanks! Def will give your post a read over break!
Podcast was just released this morning π Hope you Enjoy! podomatic.com/podcasts/oncallnight...
Hi Molly, great article!
Let me expose my, admittedly controversial, view. Isn't the mere existence of a dedicated Team of SREs a problem?
IMO several companies have been drinking the Google Kool-Aid and "rebranding" Infrastructure / Networking / Support people and eventually Devs with some tooling expertise into "SREs". Companies promise more time for people to work on automation and product optimization, but, in reality, they are still a small number of people mostly putting out π₯ on projects that they know little about. This is still Silo thinking under disguise.
On the other side of the wall, Devs without the required Ops experience are churning out half-baked, overly complex, wasteful and insecure deployments. Great, now Developers "own" it and everyone can be on call stressed, burned-out and panicking. Hooray, DevOps! Quick, let's all send our CVs to the next company! We'll all get a better paying job until the cycle repeats itself.
Alternative solution: Hire more people with Ops experience and a SRE mindset, place them in real DevOps teams (yes, you heard me: At least one talented Ops person per team). Let people with Ops backgrounds and Dev backgrounds really work together and learn from each other from the beginning. Make sure that SREs (as well as Devs) have a good relationship with their respective Chapters so that standards naturally emerge. Devs learn some tricks from Ops experts, write better infrastructure as a code and get to understand / account for what is required to monitor and troubleshoot the product that they are building from the start. Ops actually get the time to work on automation and learn the feature side of things as it's being built (plus they also get proper code reviews and learn a trick or two from experienced Devs. No disrespect but Going through Python / Bash / Ruby scripts written by Ops guys can be a nightmare. As bad as the Terraform / Ansible stuff that Devs put together). With enough time the team finds its pace and agree on a sane on call schedule.
SRE guy/girl is a very senior expert with a great overall picture of the product and the ability to optimize things across the board? No worries... Give him/her (and everyone else) the freedom to move across teams. But still make sure that he/she's part of a team that actually delivers features. Make sure that he/she stays in the team just long enough to spread some of their knowledge to the team, as well as learn more of the specifics of what is being built ATM.
I'm not saying that there's no place for dedicated Infrastructure and First Line Support teams with skilled engineers and innovative solutions. Neither I'm saying that we can neglect the overall picture and the few Devs that can actually handle it. I'm just saying that Google's SRE model is not for everyone. As it stands, it feels like most companies are getting DevOps wrong (as much as they get the core values of Agile wrong). IMO a team of SREs, huge or small, will be always simultaneously overworked (responsibility-wise) and underutilized (in terms of their actual potential). Even if you are blessed with a few people with the rare combination of domain / infrastructure and development expertise to do the job properly, it still sounds like a huge waste of their time and brainpower.
What do you think? Am I right? Or am I'm getting the SRE side of things completely wrong?
I don't speak for other companies but I do know for us at Kenna our SRE team has been a god send! Having a team that can focus solely on the reliability and scalability of our system has been a big win for us and it has greatly improved the quality of our platform for customers and for our devs and operation teams internally. I actually wrote a blog post about what our SRE team focuses on.
What It Means To Be A Site Reliability Engineer
Molly Struve γ» Apr 17 γ» 5 min read
As for knowledge sharing, we actually do a dev SRE rotation that allows devs to get a peek at what we focus on as SREs. In addition, the SRE team works very closely with our dev teams pumping out features to ensure the features are performant. They are definitely not off in a corner on there own.
Sure, having an SRE team might not be for everyone but in our case, it has been an enormous benefit.
Nice article, Molly!
I have a question for you. How do your on call members get compensated?
The companies I worked for (in Germany) typically had a one week rotation, and the person on call got compensated for being on call as well as the times they actually had to work. These companies also never paid for more than one person to be on call at the same time (so no fallback).
So now we still face the problem that the devs are on call maybe once every three months. This means, they don't get the required experience for general ops-stuff as well as being on call.
Thanks!
Being on call is part of our job so we do not make extra money for it, unfortunately.
We are also a team of 5 and each member is on-call for one week.
This works well if everybody is working full time (100%). How about if one or more members want to reduce their activity and go part-time (from 100% to 80% or 60%, working only 3 or 4 days per week instead of 5)? The one-week on-call is no longer possible.
Is one-day on-call duration a solution in this case? Any idea is welcome.
Thanks.
Alex
One day I would think would be too short and require too much context switching. Half weeks might be a good option. Definitely not easy when not everyone is on the same work schedule.
My company is betting for this format not recommended in this article: Put more people on the wheel so you are not oncall for ~2 months, but still covering a lot of pieces of the product which are not owned when we are on business hours. Definitely won't work.
To me, the way to go is to rollback always if possible and, on office hours, to fix it properly. It's not fair to WORK outside office hours, no matter how well payed it is.
Great post Molly, all of this feels extremely familiar at my company and we're heading in a similar direction. Really nice to see a positive payoff
Great post!