When we think on-call, many software and infrastructure engineers often think of late-night calls or other life events, and family disruptions when things go down unexpectedly or are no longer responding as expected.
The goal for on-call teams or the person who is responding to the issue is to get the site or app back online and working again as quickly as possible. This doesn't mean a permanent fix is always put in place to get the site or app back up, often once things are stable and recovered, engineering teams will work with the on-call team or person, but engineering will ultimately come up with a more permanent fix, once the reason(s) for the issue or event is better understood. The root cause does need to be found and communicated to various stakeholders, but that's more of an eventual outcome that's part of the blameless post-mortem process.
On-call for products, services, or infrastructure is often a team who rotates weekly or a person who is responsible for responding to and resolving operational issues according to an agreed SLA or other support agreement that's in place. Operational or support issues are often single or multiple issues or events that impact a system negatively. As I said above, it could be that a site or app has suddenly become slow, is no longer responding, is returning errors, or is simply unavailable to end users or customers. Sometimes, there are more obscure situations or issues where a portion of the system isn't working correctly, maybe the checkout or payments process, but the site or app is up and working just fine.
Typically we see both engineering and support teams working together during an on-call response to situations or events they are responsible for, this process or arrangement is called DevOps. To be clear, the DevOps label or name is often used to cover many different aspects of Technology such as development and operations, in this case, DevOps refers to the same group team of engineers who write and operate the code they maintain and support it. Software engineers who write good software must understand how the software runs in production, particularly at scale.
To quote Werner Vogels here “You build it, you run it”, the engineers who wrote an application or service, for example, will be the best to not only address the issue but also to write to formulate a fix or patch. We see IT operations on-call teams who are put on-call to support or recover applications or sites they didn't write, develop, or make code updates without the proper documentation or context, they are often ill-equipped to resolve the issue or event when it's beyond a restart, rollback for example.
For the IT Ops on-call team or person, it can be like navigating within a building, without a map of the building layout, without a flashlight, and the building may also be on fire, so time isn't on your side. This is where DevOps as I have described above becomes the better way, and it becomes the model, the way On-call needs to work particularly with software products and services. It's where the more traditional enterprise technology models, where we have IT Ops and software development teams on other sides of the wall, over the years teams, we have painfully found this doesn't work well, but there is hope.
Going back to Werner Vogels for a moment and the “You build it, you run it”, is another powerful and beneficial reason for the same DevOps team of engineers to support or run the code they wrote. Those engineers will be motivated or maybe even annoyed into fixing whatever issue(s) is keeping them up at night or disrupting their weekends. Very different indeed from having the IT Ops on-call team or person be alerted to situations or events they didn't necessarily cause, and may not be able to fix.
One other note on the DevOps teams, it's never the same, engineers will all come from varying spectrums in terms of on-the-job experience, education, and the organization and teams they have worked with. Having a more senior-principal engineer working with someone with less experience is always invaluable to all. Principal engineers are there for other team members to learn from, respect, and even disagree to disagree so there isn't a missed opportunity or idea. Principal engineers have the experience, they have been around and have seen their share, and they can look at code for example, and give us important insights into the what, the where, the how, and the when. Whereas a junior engineer, given the same scenario or review, may only give us the what, and maybe the when if we are lucky. Do we see the difference, it's simple, it boils down to experience, being right often, and having the developed, tuned knowledge and intuition.
One other final thought, having a better understanding of what the customer requirements are now in terms of an SLA or other support agreement helps the DevOps team to build, iterate, and improve upon what's already been built, deployed, and secured.