Production incidents may be the worst kind of lean IT waste.
Let's stop having them!
Paste this as the content of meeting invites to keep everyone informed on what a Blameless Post Mortem is and why we should always conduct them.
- Dissect the events as we understand (timeline)
- Discuss actionable steps that can be taken to assert this error (as we understand it) does not happen again
- List of actionable ideas (stories/epics)
- NOT: “pay attention! or “be more careful!”
- Follow-up meeting to observe progress on
If you have to do something manually more than once, automate it so you never have to do it again.
Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.
Lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.
Having a “blameless Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:
- what actions they took at what time,
- what effects they observed,
- expectations they had,
- assumptions they had made,
- and their understanding of timeline of events as they occurred.
- and that they can give this detailed accountÂ without fear of punishment or retribution.
- Engineer takes action and contributes to a failure or incident.
- Engineer is punished, shamed, blamed, or retrained.
- Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
- Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass engineering (from fear of punishment)
- Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
- Errors more likely, latent conditions can’t be identified due to #5, above
- Repeat from step 1