Back when our team was small, we put together a single on-call rotation. Every dev was in the rotation and would go on-call for one week at a time. When we first started the rotation our team had 5 devs on it. One year after another passed, and despite our team growth we still stuck with the single rotation. Eventually, the team was so big that people were going on-call once every 3-4 months. This may seem like a dream come true, but in reality, it was far from it.
The single on-call rotation was miserable for just about everyone for a variety of reasons.
The large rotation meant that on-call shifts were so infrequent that devs were not able to get the experience and reps they needed to know how to handle on-call issues effectively. In addition, our code base had grown tremendously and there were so many things being developed at once that when a problem arose there was a good chance the on-call dev knew nothing about it or the code that was causing it.
This led to panicked developers often turning to the Site Reliability Engineering(SRE) team for help with issues. Constantly having to jump in and help with on-call issues quickly began to drain a lot of the SRE team's time and resources. Essentially, the team began to act as if they were on-call 24/7. The constant bombardment of questions and requests came very close to burning out the entire team and took away valuable time they needed to work on their own projects.
Besides having a burned out and inefficiently used SRE team, another developer gripe about on-call was that developers felt like they had no ownership over the code they were supporting. One person would write code and another person would be the one debugging it if it broke. The app was so big that there was no way anyone could have a sense of ownership over the production code since there was just too much and they were expected to support all of it.
Due to the size of our engineering organization, we now have 3 separate dev teams. Each team has 5-7 devs on it plus a manager. Each team also is given its own set of projects. However, our main application still is a single monolithic Rails app. All three teams work equally across the entire codebase. Unlike other apps which have very separate backend components owned by individual teams, there are no clear or obvious lines of ownership. Solving this issue would prove to be the hardest task when it came to fixing our on-call system.
We knew we had to break up the rotation if we wanted to continue growing, but the question was how? Despite all of the developers working across a single application with no clearly defined lines of ownership, we devised a plan that broke our single rotation into 3, one for each of our 3 dev teams. This led to shorter rotations, which meant more reps for devs. As backward as it may sound, being on-call more is a benefit because devs have become a lot more comfortable with it and are able to really figure out a strategy that works best for them.
3 rotations allowed the devs to get more reps being on-call, but that still left the biggest problem of all and that was the problem of ownership. No one wants to support something they don't feel like they own. To accomplish this we choose to split up the on-call application ownership amongst the 3 dev teams. It didn't happen overnight, but with a few meetings and a lot of team discussions, we were able to break up everything in our application between the 3 teams.
- We broke up all the background workers, for example:
- Team 1: Indexing jobs
- Team 2: Overnight reporting jobs
- Team 3: Client communication jobs
- We broke up all the individual service alerts, for example:
- Team 1: Redis alerts, Queue backup alerts
- Team 2: Elasticsearch alerts, API traffic alerts
- Team 3: MySQL alerts, User load page alerts
- We broke up the application components, for example:
- Team 1: Users and Alert models and controllers
- Team 2: Asset and Vulnerability models and controllers
- Team 3: Reporting and Emailing models and controllers
Once the lines had been drawn, we made sure to stress to each of the dev teams that despite doing our best to balance the code equally we might still have to move things around. This showed the devs that we were fully invested in making sure this new on-call rotation was fair and better for everyone.
After the code was split up the SRE team took time to sit down with each dev team to thoroughly review the app components, workers, and alerts they now owned. We went over everything from common issues to exactly what every single piece of code did and how it affected the rest of the application. These sessions have given devs a lot more confidence in their ability to handle on-call situations because they now have a clear picture of what they own and how to handle it. Even though they haven't built some of the code themselves, they have an understanding of exactly how it works and what it is doing.
In addition to giving each team an education over their section of code, we also took advantage of Gitlab's CODEOWNERS file. The CODEOWNERS file allows you to specify who or what teams in your organization own a file. When that file is updated by anyone in a PR the owner of the file will automatically be tagged for review.
Originally the SRE team was the fallback for the on-call dev. If the on-call dev had questions or needed help they would talk to the SRE team member that was on-call that week. Our SRE team only has 3 members currently so you can see why we got burned out being the constant fallback. With the new system, the 3 on-call devs all act as fallbacks for each other. If any of them get overwhelmed or stuck on an issue they are encouraged to reach out to one of the other on-call devs for help.
In addition to the above changes, we also removed some duties from the on-call devs. Prior to these on-call rotation changes, the on-call devs were responsible for determining whether a status page or any customer messaging was needed during an incident. We have since moved that responsibility to the support team. The support team is the closest to the customer, and therefore, are the best equipped to communicate any problems. When an incident occurs that affects customers, the support team is notified and is responsible for determining if a status page or any customer communication is needed. Giving this responsibility to the support team allows devs to focus solely on diagnosing and solving the problem at hand.
Originally, the SRE team had set up all the alerting and monitoring tools. However, once we turned the alerts over to each of the dev teams they took them and ran. Because each team felt a renewed sense of ownership over their alerts they started to improve and build on them. Not only did they make more alerts, but they also improved the accuracy of the existing ones.
Even though one team might edit the code that another team supports, there is still a keen sense of ownership for the supporting team. The supporting team acts almost as the domain experts over their section of code. Using the CODEOWNERS file ensures that the supporting team is made aware and can sign off on any changes made by other teams. Because each chunk of code a team supports is small, they can actually learn and support it all unlike before where it was all just too much for anyone to handle.
With 3 devs on-call at once and each one of them focusing on a smaller piece of the application, they can spot problems a lot faster. Each team also wants to ensure that when things do go wrong they are caught quickly which is why many choose to tune their alerts to warn them of problems even faster than they were originally set up for.
Triaging and figuring out the root cause of issues is also a lot faster. Teams are intimately familiar with their alerts and the pieces of code they own which allows them to figure out problems quicker than before.
Having 3 devs on-call at once means that none of them ever feel alone. If things start to fall apart in one section of the application, the dev that owns that part knows there are two others available to help if they need it. Just knowing you have someone else easily accessible can do wonders to your confidence when you are on-call.
As I stated before, each of the 3 dev teams works across the entire application. This means that there are times when a team might work on code that ends up causing an alert for another team to go off.
For example, let's say a high load Redis alert goes off. It is not a guarantee that the team that owns that alert also owns the piece of the application causing the problem. However, because the Redis alert team is experienced with the alert, they know how to triage it quickly and efficiently. Then, the triaging team can easily hand it off to the team that owns the problem component. This cross-team communication has helped teams stay current with each other's work, but they never feel like they are having to fix other team's code.
On-call is something that many people in this industry dread and it shouldn't be that way. If people are dreading on-call then something is broken with your system. Sure, everyone at some point will get that late night or weekend page that is a pain, but that pain shouldn't be the norm. If on-call makes people want to pull their hair out ALL the time, you have to figure out the problem and fix it.