DEV Community

Page It to the Limit

Building an Incident Response Plan With John Allspaw

John Allspaw joins us this week to talk about incident response, and helping organizations build their own NTSB (the National Transportation Safety Board, a US government agency that investigates transportation accidents).

Introduction

John gives us an overview of what he and the other folks at Adaptive Capacity Labs are working on.

State of the Industry

John talks about the state of the industry around incident response. Learning from incidents is happening; but are organizations supporting it? Are people finding it helpful? Expertise is coming from inside the house, in that software practitioners are getting better at coping with the complexity of the systems. Where there is still work to do is around how teams learn from their incidents and postmortems. Are the artifacts generated by these exercises used after they are created, or do they just become a museum to the incidents?

Who are the Incident Nerds

There is an emerging community of folks who are really enthusiastic about learning from other organizations’ incidents, in software and across different kinds of industries. But many incident reports are still written to be filed rather than written to be read, and leave out some of the other aspects of an incident that are important. In addition to just whatever triggered the incident, there are other aspects to be learned from, like weighing potential fixes, or finding information.

Thinking about incident reports as a story, what elements make it a good story? What did the team struggle with? What was hard about it?

What are We Learning from Other Industries?

A number of “safety critical” domains have a longer history than software development with respect to dealing with incidents. Some domains have different constraints, challenges for gathering data, legal ramifications. What will the future look like, and can software development avoid some of those constraints.

How do all of these potential incidents impact not just the employees on the teams managing the incidents, but also the public, consumers, and how are they impacted by an outage? Are consumers able to make informed decisions about a company based on how incidents are handled?

As a learning exercise, do your new employees take the opportunity to read past incident reports and then ask questions?

Debunking a Myth

John deflects answering about Root Cause. You’ll just have to check Twitter.

An existing belief that leads people to a potentially incorrect outcome, is that an incident is seen by different people, with different perspectives, as different. There is not one true universal story that everyone will get from reading an incident analysis.

The goal shouldn’t be for just the person doing the analysis to understand what happened, but to also make future readers understand what happened.

What Do You Wish You’d Known Earlier in Your Career?

John talks about the practice of software engineering, and the certainty that things has changed and will change. Everything should be up for questioning of assumptions about what is the best way for something to be done.

What Are You Glad We Didn’t Ask?

John talks a bit about incident command frameworks and refers to Laura Maguire’s research on the costs of coordination and how the costs associated with robust incident response are easy to forget. Laura’s talk is linked in the Additional Resources below.

Additional Resources

Episode source