I still recall the first time I was dragged into a live incident call. It was many years ago but I still remember it as it was yesterday. That terrifying feeling of being pulled into a live conference full of strangers that will probably expect from you to be the superhuman expert that you are meant to be and not a scared person more strangled than ever by an overwhelming impostor syndrome. I mean, are they even amicable at all? Are they pulling you into this call for blaming you and the rest of your team? Will you even have a job after today?
I'll be honest. Looking back and reflecting on that time, I did everything I could absolutely wrong. Took me ages to understand what was the issue. Spent an awful amount of time trying to debug it. Spent even more time writing a patch and deploying it. And, yes, it did work, and it fixed the problem. But it also left very clear that I had many things to understand and improve about incident management.
Let's be clear. No one. Absolutely no one, likes being on-call. Who wants to be awaken at 2AM for troubleshooting a live production issue? Nobody. However, truth is, modern engineering practices ask for engineers to be on call. And believe me, it is a good thing. In my whole career, there hasn't been a place where I have learnt more about our products, about our operations, about other teams and even about the source code I write like on incident calls.
So, when your manager announces a new on-call policy that is going to include the engineering team. That is not bad at all. But, on the other hand, it is very important to set some rules. There are definitively things, that your manager and the company need to guarantee for this to be good and not rather become a toxic measure.
Let's look into the things that the team should demand from an engineering leader to be met before getting on-call:
This one is simple and there isn't really much to add up. On-call time is work and it must be paid. Let's repeat it. It must be paid. That should still hold true when you don't have to deal with incidents.
Being available for on-call purposes demands you and your beloved ones to be ready for interruption at any time. You will need an internet connection and some device, likely a computer, nearby and of course to be ready to sacrifice your leisure time for supporting your company. The bare minimum ask incident handling time paid, but bear in mind that, although with different rates, most companies will pay employees for just the sake of being ready to take incident calls. Finally, usually incident handling time should be paid at an extra when compared with regular working hours work.
A runbook is essentially a set of clear instructions that help to operate a service. Part of a runbook is usually the process that determines the steps to take in case something goes wrong with that service. With a properly documented runbook, a person not familiar with the service itself can still have good changes to address the issues by going through the troubleshooting steps.
Things like "what to do when service foo takes 100% CPU?", "What are the steps to redeploy service bar", ... That's the kind of information you want other people to find in those runbooks, so you don't have to actually jump into the call and have to those things yourself.
Knowing what to do and what not to do while you are trying to troubleshoot a live urgent issue is very important. And spoiler alert, usually the solutions will be either rolling back or redeploying.
Most of the time, services go down due to unexpected change. It could be that a new version of the service has been deployed and it has a bug. Or perhaps deployment happened 10 hours ago and a resource leak has slowing degrading performance. But once change is identified the most obvious solution is to roll that change back. Sometimes you won't be able to roll back but perhaps there are some toggles or config variables that you can turn off before triggering a service redeployment to get the issue fixed. Some other times you will need to scale up your service, vertically or horizontally, either to buy time or because it really needs due to healthy natural usage growth.
The above solutions have something in common. They are based on understanding changes, not on understanding source code. You don't need to be an expert in the internals of some particular piece of software to operate it. This sort of thing, that wills sound totally natural for an ops guy, is a fundamental paradigm shift for a software engineer who is used to mentally map everything that happens in a service to lines of code.
Learning and being trained on how to operate successfully a service in a simple manner is a fundamental ask that any manager or leader should address and help its team with.
If we are saying that change is the main cause for failure, then having a clear list of all the changes that any system goes through seems like a very basic and fundamental thing to have.
A changelog can be implemented in many ways. It ca just be a wiki page, or it can be done via JIRA tickets, or perhaps via Github issues, or just built from a list of notifications that are dumped into a Slack channel that people can go to for checking what has changed. You can go as fancy as you want. What is really most important is to have a centralised place where anyone can find what has changed with clear timestamps and if possible also authors.
Needless to say, those changelog entries shouldn't be just meaningless titles. Whatever relevant information it can be added to the changelog might be crucial for others to understand the change, and for being able to deal with failure more effectively. I know, it is kind of silly and obvious. Adding as much information as you can. But it will likely be the difference from you being woke up at night to help troubleshooting something that nobody understand vs your same self having wonderful sleepful night
Having a large team for incident handling is great. A large team means that it will take longer for any individual engineer to be on-call. For example an eight people team, doing one week on-call shifts, implies that every engineer will be on-call just one week every two months. That's not too bad. But it also means that is very easy to loose context when you happen to start the on-call shift. What has happened last two months? I mean... did anything happen last week?
Doing a weekly handover meeting where the person exiting the shift explains to others how the week has gone and what were the main problems observed will give context to everyone in the team about the overall service status and most importantly it will give the person going into on-call some expectations for the next week.
Sometimes things will go south. That's unavoidable. And at those moments, where things are tricky and problems become very hard to solve, and the calls to be made are big, well that's usually were escalations and leaders are needed.
Typically, on-call rotations on reasonably large systems are tiered. You will have a first tier with software engineers familiar to the daily service activities and operations, then you would have a second tier, maybe with technical leaders and engineering managers, then third tier with directors, and so on.
I honestly can emphasise with a person feeling that leaders should be in those incidents. I mean, who wouldn't be pissed if you are on a call at 2AM while your boss is sleeping, right? Specially if the causes for the incidents might be due to leadership choices like postponing technical debt grooming and such things.
But at the same time, I do also believe that having a single tier or having leaders as part of the normal rotation is not really a good thing. It might be something needed when the team is small. For example when a two people team will greatly benefit from having an additional person on-call. But when the team is larger, then it becomes less-relevant to have the engineering leader regularly on-call.
In fact, it is way, way more important that the engineering leaders are always on-call, but just when needed.
When incidents happen, it is important to do some retrospective thinking on why that problem happened. Some teams choose to write RCA ( Root Cause Analysis ) documents and then having some internal session to analyse an incident.
Another technique that has become very popular with time is to run a 5 Whys Session where multiple stakeholders, internal or external, might meet to go into a process where candid questions are asked trying to find reasons for what has happened.
Whatever the format, it is very important to bear in mind that in these sessions, blame is never the final goal. Mature organisations know that failure happens every day. The most important goal for any retrospective, whatever the shape, format, or stakeholders involved, is to understand why and to learn to preventing similar incidents from happening again. That's also why it is important to get clear action items as an outcome from any retrospective and make sure those items are prioritised over any of the ongoing regular tasks.
An Error Budget process is key to make sure incidents won't happen again. More often than not, change originating incidents is caused by engineering debt. Debt can be technical, as for example some code that is causing performance issues or some algorithm that has become too inefficient due to increased usage.
Debt can also happen due to inadequate processes. Examples of process debt are teams letting code in that does not meet quality standards, perhaps because code coverage rules are not enforced, or static analysis is not executed, etc. Process debt can also comprehend usual practices used for releasing software safely but that, for whatever reasons, the team hasn't embraced yet. Practices like launch canary releases first, or using feature toggles to gradually turn features on or to turn features off when something goes wrong, or having the capability to change configuration dynamically without redeployments. All those practices can help to avoid incidents or to react much faster when incidents happen.
On an error budget process, when a certain service or team does not meet the expected quality thresholds that have been set, then that team must commit to stop development of any functionality that is non related to the incident. It does not matter how important that functionality is or how big is the customer asking for it. The team needs to stop whatever is doing and start working on addressing the technical issues. That might take just a couple of days, but could also take many weeks. Having a process that allows team to focus on technical debt is healthy. Although, caution, it should not be driven by incidents alone.
Getting to an end here but this is probably the luxury ask and certainly something that not every team can afford. But when you can, it works great. When you trump all the points given above, it becomes easier to have engineering teams in other timezones for covering multiple services. When you have done your homework setting clear rules and processes for handling failure and for dealing with incidents, and when everyone is trained and knowing what to do, then supporting services that you are not aware of is not a scary thing any longer.
And when that happens, you don't need the core service team to be the only ones supporting a particular piece of software. It is now possible, provided you have the size and the budget, to have other teams in other different timezones doing on-call in such way that engineers can have their night-shift covered and they only have to do on-call work in working hours.
Again, this is common in large corporations but really a luxury ask.
Being on-call might be a scary thing but it should not be if you have the support of your company and a good leader. Reflecting back, I can only see my on-call time as level-up experiences. For a software engineer, it is unbelievable the amount of knowledge ande experience that it can be gained by escaping our personal source code bubbles and jumping into trying to solve our customers' pains.
At least that has been my experience. Hopefully you found this essay useful. Best of luck with your on-call!