Software at Scale
Manageable On-Call for Companies without Money Printers
Google’s SRE book has become a Bible for software companies that want to learn better ops practices. It’s a well-written and informative book about concepts that weren’t talked much about previously in our industry. It’s also free to read online and is not part of a marketing or sales strategy with shady affiliate links or a “pay to upgrade” bonus edition.
But one chapter in particular (Being On-Call) has advice that seems to only apply to seemingly utopian companies with unlimited resourcing and headcount. So I wanted to frame some of their concepts in a way that’s useful for smaller organizations, explore some even more basic principles around both technical and organizational aspects of managing sustainable on-call rotations for all sorts of software teams, and hopefully give some practical advice.
Responsibility for Manageable On-Call
First and foremost, the responsibility for a manageable on-call rotation falls squarely on the engineering manager of the team. Engineers alone cannot solve the problem of unsustainable on-call load without the help and support of management, since on-call rotation sustainability involves both technical and organizational pieces. For example, a rotation might be overloaded due to technical reasons like unreliable software or alarms, but small team size or poorly performing team members might be part of the problem. If the team has an unsustainable on-call rotation, engineers should suggest technical solutions as well as escalate issues to management. If management seems unwilling or unable to solve the problem given reasonable time, either by prioritization of investment in quality or other means, engineers should be comfortable with voting with their feet.
A common refrain is that leaving a team due to poor on-call will make the experience worse for the rest of the engineers on the team, and I personally have stayed on teams longer than I should have for that reason. Recognizing that dealing with attrition, for any reason, is ultimately management’s responsibility, and that as an engineer, you can only do so much, should help alleviate this concern. The company would not hesitate to let you go if you neglect to perform your job’s responsibilities, so it’s your right to do the same with your team.
Do You Really Need to Be On-Call 24/7?
One guiding assumption is that teams have paging alerts that could go off 24/7. This is often true at Google scale, but teams should decide it applies to them from first principles, rather than assuming that more on-call is better. We can start by asking whether the team’s product having a total outage during non-business hours will cause a large loss in functionality for a large set of customers, serious revenue loss, or brand loss. This sounds simple but automatically implies that most internal tools teams and teams with limited product scope do not need to have a 24/7 on-call. For example, a team that owns an internal non-revenue critical data pipeline, the Jenkins cluster, or a team that owns a non-critical feature on the app’s homepage, should default to not have a 24/7 rotation (as long as the rest of the app can handle its failure gracefully).
To be clear, teams could choose to have a rotation that’s 24/7 even if they don’t satisfy these constraints. This trade-off between customer value and on-call sustainability should be explicit and might make sense in some cases (for example a large and quiet rotation).
Establishing an SLA and SLO
Products should have quality metrics to establish what’s important for customers and drive investments into alerting and infrastructural improvements. These can take the form of SLOs (service level objectives) and external SLAs (service level agreements). The SLA is the lowest possible number that you and your customers will be okay with, not an aspirational number that you’ve heard Amazon uses. Many multi-billion dollar SaaS enterprise businesses have been built based on two nines of availability. Even household name companies with a strong consumer presence do okay with three nines. Ideally, you’d only change it based on customer needs, for example, your sales team bringing it up as a problem during prospect conversations. Your external SLA has to be a tad bit lower than your SLO. For example, if your SLA is 99.8% monthly uptime, your SLO should be 99.9%, so that you don’t fly too close to the sun. This SLO will help drive various other factors of your on-call rotation.
Acknowledgment Time
The general expectation should be that once a team member has acknowledged an alert, they have started debugging the issue, or they will page someone else for help. The flip side is that if the developer cannot debug the problem at that time, they shouldn’t acknowledge the alert so that it can escalate to a secondary or up the management chain.
It’s important to figure out the time for acknowledging paging alerts for both business and non-business hour rotations. If your team is comfortable setting it to thirty minutes or longer, you don’t need to think too much about it. Anything under that requires some due diligence to justify the sense of investment towards the pager - otherwise, the team’s ability to do basic life things like go for a walk without their laptop is taken away, which shouldn’t be taken too lightly.
To establish pager acknowledgment times, we should first go through the process of establishing an SLO. With this SLO, you can calculate a maximum monthly outage duration - how long your product can have an outage for without breaking its SLO. Then, you have to calculate the average duration of operations that your team would need to take in order to resolve an outage. For example, if you assume that your team will take ten minutes to have a reasonable guess on a problem, the most common mitigation is to perform a rollback, and it takes twenty minutes to perform a rollback on average, your time to mitigation is at least thirty minutes. If you can expect two such outages a month, your product will generally be down for an hour a month. If your SLO lets you be down for 120 minutes a month, and you want a buffer of 30 minutes, then you should account for 90 minutes of outages. To ensure only 90 minutes of downtime with an expected two incidents a month, your pager’s acknowledgment time can be no less than fifteen minutes. Obviously, this could also imply that it takes too long for your product to go through a rollback, and it could help guide a decision towards investing in that.
In practice, this math implies that an SLO at or above three nines would almost certainly require a five-minute pager acknowledgment time, so be careful about aggressive SLOs.
Secondary On-Call and Escalations
Generally, if your team has a 24/7 rotation, you absolutely need a secondary on-call. By default, the escalation chain for alerts should then go up the management chain. The leadership hierarchy will then be made aware of and will have to deal with incidents that come their way so that they can spot signs of problems in their teams early, and management is incentivized to improve quality and reduce alerts.
If your team doesn’t have a 24/7 rotation, don’t bother with a secondary. Depending on the expected workload, planning processes should assign less work to the on-call engineer, and on-call work is expected to be a top priority for them.
Team Sizes
If your pager acknowledgment time is fifteen minutes or under, then on-call rotations have to be large, and on-call weeks have to be generally uneventful in order to remain sustainable. Otherwise, teams might enter a death spiral when there are eventualities like attrition. For example, a rotation might feel sustainable for several months but would become immediately unsustainable if a team member or two quit after vesting cycles/bonuses, and more team members would then start looking for alternative employment.
The Google SRE book recommends minimum team sizes of at least eight unless rotations can be split up across several team members in two or more time zones. In practice, this can be challenging for teams to maintain. Headcount is often scarce, and decision-makers have competing priorities to consider while allocating team members. Also, most companies do not have engineering offices in multiple time zones. If it’s difficult to scale up your team, your best bet is to make the rotation extremely quiet via technical investments. A quiet on-call rotation, to me, is one with less than four paging alerts every week, and zero non-business hour paging alerts. If an on-call rotation can manage that consistently, then a smaller rotation size (no less than six) is acceptable. It’s still disruptive to be on-call, but at least the stress of the rotation is taken off.
Hand-offs
On some fixed frequency (most commonly weekly), members of the rotation should meet to discuss alerts, hand-off on-call to the next team member, and propose action items. Each action item should have a proposed priority and size (most commonly in story points). The technical leader of the team (or someone delegated by them) should globally order all action items for the next round of planning, with input from the rest of the team if necessary. The engineering manager should ensure at least X story points of on-call work is scheduled every planning cycle and assign work so that action items are evenly distributed across the team. This process ensures a consistent momentum of improvement to the rotation without overburdening a few team members with action item work.
Action items should be created every time a paging alert goes off more than once. I am explicitly against Low Priority Alerts, wherever possible.
Conclusion
These were some ideas on managing on-call rotations sustainably. Each of these ideas has many layers and the corresponding problems could be addressed in multiple ways. For example, since management is ultimately responsible for on-call sustainability, you might have some managers who want to be on the rotation. Hopefully, this post was valuable and gave you something to think about.