Beyond Why: 3 Fallacies of Root Cause Analysis

#devops #observabiliy #incidentmanagement #softwareengineering

One day I was watching a presentation about a generic approach for solving problems. During the presentation, the engineer presented some pseudocode to demonstrate his algorithm to find the root cause, and went something like this:

func FindRootCause(p Problem) Problem {
    if ! isPhilosophical(p) {
        return FindRootCause(Why(p))
    }
    return p
}

This truly bothered me, so I took the time to understand why was the case, and how I’d go about this. The first thing that stood out to me is this “Is philosophical?” question as a recursion stop condition. There are several problems with this. First, if your problem Is indeed due to following the wrong philosophy, you are going to find yourself in an infinite loop. Is this question well-defined? Can you easily differentiate between a philosophical question and a “normal” one? Is this question consistent, meaning that it won’t matter who is asking the answer should be the same? Another key problem of this for me is that it comes from an assumption that philosophical questions are useless and shouldn’t be made. I couldn’t disagree more, but I understand where this comes from. We often see philosophy so detached from reality that it’s perceived that way. I’d suggest abandoning such philosophies, but I guess that’s a topic for another day.

But that was not my only problem with this algorithm. I was missing how culture can influence such discovery. I couldn’t get out of my mind a scenario in a finger-pointing toxic-cultured company going through this process for a bad deployment. Why did the bad deployment happened? Because an important test was skipped. Why did the test was skipped? Because John forgot it. Why did John forget it? Because he’s stupid. Fire John, problem solved, right? I know that this is an oversimplification, but I think it’s a good illustration of how bad this “keep asking why” approach can be.

So all this led me to this point. Now what? How to fix this? How to know when to stop asking why? How to not get lost in philosophical questions? Well, I want to start this by trying to address 3 fallacies with this root cause analysis process.

All problems have a single root cause

If you ask anyone during this process, everyone will say “Of course this happened for multiple reasons”, but the most common way this assumption is introduced in the process is by allowing a single answer for “Why?”. Allowing each question to have multiple answers broadens the possible solutions, but it comes with an increasing cost of investigation.
Every cause needs a single action

I think this comes in two ways: one I often perceive this as a “someone has to pay for this” bias. Although I believe that having a bias towards action can be beneficial, I think it doesn’t hurt to take time to reason about tradeoffs. Especially with this broad scope of solutions, it makes sense to wonder “Should we solve this?” or “What’s the cost of this?”. No action may be needed for a given scenario, so this should be considered. Another way this can happen is by limiting the actions. In the bad deployment example above, you might need to add extra steps in the code review process, improve release quality gates, and improve developers’ access to training about tests. It’s important to note that these are not alternative solutions: all of them are concurrent improvements that can be done to this process.
Solving the root cause will prevent this problem from happening again.

Often, when people suggest doing a root cause analysis, the objective is to avoid the same problem from happening again and again. If we keep solving symptoms, that will be the case. But often in complex scenarios, the root cause solution is just too specific to prevent it from happening again. In the same example for a bad deployment, often solving the root cause is just to implement a regression test. This is where happens most of the friction with other areas, like product or operations: “I thought we wouldn’t have this again” can be a quite frustrating phrase when the symptom re-appears for a different reason, a different root cause.

Now that we have this, we can invest in trying to define a process that addresses these the best we can. I’m not going to provide pseudocode for it, but I’d share some ideas of questions that can be used to improve this process. Hopefully, they are far more interesting than “Why?”

What processes we have in place could have prevented or mitigated this? Why they didn’t?
What new processes we could implement to prevent or reduce the impact? What’s the added cost?
Did we have similar problems recently? Are they related to the same processes?
How the problem was identified? Was that appropriate? Can we do that earlier, reducing the impact?
How do proposed solutions affect other areas of the company, like culture, knowledge sharing, etc.?

All that being said, we have one last thing to address: philosophical questions. When trying to identify what makes these questions special, we need to go back to what we are trying to avoid. We want to prevent the process from going through endless unproductive discussions. So these questions are often described as: complex, with a high degree of ambiguity, very time and energy-intense to understand properly. I believe that is where the leadership comes to place. Not exactly to solve these questions, but to provide the answer we are going for. That answer doesn’t need to be right, but if communicated properly, makes these questions the easiest ones to answer. What happens if the leadership is just wrong? Well, you’ll stumble into this question so often that you’ll need to revisit it. In the pseudocode, philosophical questions should be memoized.

If you read all this and thought “Well, that’s just too complicated” it’s because it is. Providing a generic framework to solve any problem has no way to be simple. This reminds me of the quote: “For every complex problem there is an answer that is clear, simple, and wrong.”

DEV Community

Beyond Why: 3 Fallacies of Root Cause Analysis

Top comments (0)

Read next

🐳 10 Docker Best Practices Every Developer Should Know - With Examples!

Optimizing Your Amazon Web Services Email Address: A Comprehensive Guide

kentaka: Exploring DevCycle API Integration

Day 24: Thanks and Goodbye!