This post is third in the series of posts about the rules of debugging.
Adopting a methodical strategy for identifying and rectifying errors is essential for achieving success. The initial and primary phase of this approach involves achieving consistency in the failure. Direct your debugging efforts to ensure that each test conducted propels you incrementally closer to a solution.
In his seminal book, Code Complete, Steve McConnell, simply draws parallels in debugging based on the Scientific Method, which is the process of discovery and demonstration necessary for scientific investigation.
Based on the Scientific approach, an effective debugging approach would consist of the following steps:
- Stabilize the error
- Locate the source of the error
- Fix the defect
- Test the fix
- Look for similar errors
As you can see, the initial phase hinges on achieving repeatability. Diagnosing a defect becomes more manageable when it can be stabilized, ensuring consistent and reliable occurrence. Diagnosing a defect that lacks reliability in its occurrence becomes a formidable challenge. Establishing predictability for an intermittent defect ranks among the most demanding endeavors in the debugging process.
Stabilize the Error
An error that manifests unpredictably is typically rooted in either initialization discrepancies or timing irregularities. The process of stabilizing such an error encompasses more than merely identifying a test case that triggers the error. It entails refining the test case to its simplest form while still yielding the error.
There are three primary reasons why you are trying to make it fail consistently:
In order to see it fail, you have to be able to make it fail. You have to make it fail as regularly as possible.
Knowing under exactly what conditions it will fail helps you focus on probable causes.
Once you think that you've fixed the problem having a surefire way to make it fail gives you a surefire test of whether you fixed it.
However, how can you induce a failure? Well, a straightforward approach involves utilizing the system under normal conditions and observing instances of incorrect behavior. Naturally, this process aligns with the concept of testing; however, the crucial aspect lies in the ability to reproduce the failure repeatedly, extending beyond the initial occurrence. While a comprehensively documented testing procedure offers advantages, the primary focus should revolve around maintaining the mindset that a solitary failure is insufficient.
Examine your actions and replicate them deliberately. Record each action as you proceed. Subsequently, adhere to your documented procedure to verify its consistent capacity to trigger the error. Nonetheless, certain scenarios exist where orchestrating failure could result in harm or undesirable consequences. In such instances, perpetuating the same mode of failure consistently might not be prudent. Adjustments must be made to minimize the extent of potential damage, while striving to retain the essence of the original system and sequence to the greatest extent possible.
Frequently, the necessary actions are concise and minimal. On occasion, the order of events might be uncomplicated, yet a substantial amount of preparatory work is essential.
Bugs can depend on a complex state of the machine
Due to the potential for bugs to rely on intricate machine states, it is imperative to meticulously observe and document the machine's state before initiating your sequence.
If the failure sequence involves numerous manual actions, streamlining the process through automation can prove advantageous.
In numerous instances, the failure manifests itself only after a significant number of iterations, making it beneficial to employ an automated testing mechanism throughout the night.
Distinguishing between inducing failure (good) and duplicating failure artificially (good) is crucial. It is acceptable to replicate the circumstances that lead to failure, but it's advisable to steer clear of artificially reproducing the failure mechanism itself.
In instances involving intermittent bugs, you might speculate that a specific underlying mechanism is responsible for the failure. In response, you could construct a configuration that exercises this mechanism and subsequently observe a higher frequency of failures. Alternatively, if you encounter a bug discovered at a remote location, you might attempt to establish a comparable system in your own environment. In either scenario, your objective is to simulate the failure – essentially, to recreate it – albeit through an alternate approach or on a distinct system.
Don't simulate the failure
When attempting to deduce the failure mechanism, simulations frequently prove ineffective. This is typically due to either an inaccurate assumption or the alteration of conditions during testing. As a result, your simulated setup might exhibit consistent flawless performance or, more problematically, encounter a fresh failure mode that diverts your attention from the original bug you were initially investigating.
You have enough bugs already; don't try to create new ones.
You're already dealing with a sufficient number of bugs; there's no need to intentionally introduce new ones. Utilize instrumentation to examine the source of the issue, but refrain from altering the underlying mechanism as it is the very cause of the failure.
Attempting to replicate a bug by inducing it on a similar system can be beneficial, but within certain boundaries. If a bug can be reproduced on multiple systems, it signifies a design flaw rather than an isolated system malfunction. Recreating the issue on specific configurations while excluding others aids in narrowing down potential root causes. However, if you encounter difficulty in swiftly reproducing the bug, avoid modifying your simulation to force its occurrence. Doing so would result in generating new configurations rather than examining a copy of the one that failed.
When dealing with a system that experiences regular or intermittent failures, focus your efforts on addressing the problem within that specific system and configuration.
Remember, this does not imply that you should avoid automating or intensifying your testing to trigger the failure. Automation can expedite the occurrence of intermittent issues, while intensification can make subtle problems more evident. Both methods contribute to provoking the failure without artificially simulating the malfunctioning mechanism. Any adjustments made should operate at a higher level, not altering how the system fails, but rather influencing its frequency. However, exercise caution to prevent excessive modifications that could potentially introduce new complications.
Identify the sporadic occurrence of an uncontrolled condition. The challenge of "Making it Fail" becomes significantly greater when the failure transpires intermittently. Many intricate challenges exhibit intermittent patterns, which occasionally deter us from strictly adhering to this principle, given its inherent complexity. While you might precisely understand the steps that led to the initial failure, reproducing it consistently remains elusive – perhaps occurring only once in every five, ten, or even a hundred attempts.
The critical aspect to recognize is that while you possess a clear understanding of the actions that triggered the failure, you lack exhaustive knowledge of all the precise conditions. Unnoticed or uncontrollable factors invariably play a role. Gaining mastery over these diverse conditions equips you to induce the failure consistently. However, circumstances may arise where certain conditions remain beyond your control.
It's crucial to possess the ability to closely examine the occurrence of the failure. In situations where the failure isn't consistent, you must thoroughly analyze it whenever it does happen, disregarding the numerous instances when it doesn't. The pivotal strategy involves gathering comprehensive data during each run, allowing you to conduct an in-depth analysis once the failure has been confirmed. This can be achieved by generating extensive system output while it's operational and archiving this information within a designated "debug log" file.
By scrutinizing the accumulated data, you can readily contrast a failed run with a successful one. If you manage to capture the relevant information, you'll likely discern discernible distinctions between instances of successful execution and those resulting in failure. Pay special attention to the factors unique to the failure cases. These distinctions form the basis of your investigation during the debugging process.
Even in scenarios where the failure occurs intermittently, this approach enables you to identify and document the occurrences systematically, thereby allowing you to address them as if they were consistently happening.
The second rationale behind inducing failure is to gain insights into its underlying cause. With intermittent issues, you might begin noticing patterns in your actions that appear connected to the failure. While this can be useful, it's important to exercise caution and not become excessively fixated on these patterns.
In cases of random failures, obtaining a statistically significant number of samples to determine whether seemingly minor actions, such as clicking a button with your left or right hand, truly affect the outcome, is often unfeasible.
Frequently, coincidences may lead you to believe that one condition increases the likelihood of the problem compared to another. This can lead you down a path of investigating differences between conditions that might not actually be directly responsible for the issue, resulting in wasted effort.
However, this doesn't discount the possibility that the observed coincidental differences are somehow related to the bug. Yet, if these differences don't exert a direct influence, they can easily be overshadowed by other random factors, making it challenging to deduce a clear connection.
By accumulating a substantial volume of information, you can distinguish elements consistently associated with the bug from those never linked to it. These are the aspects you should concentrate on while exploring plausible causes of the problem.
Undeniably, randomness complicates the process of confirming a fix. For instance, if the test reveals a 10% failure rate, and your intervention reduces it to 3% — but you cease testing after 28 attempts — you might believe the issue is resolved, even though it isn't.
While employing statistical testing is beneficial, it's even more advantageous to identify a sequence of events invariably linked to the failure, even if the occurrence of the sequence is intermittent. When the sequence arises, failure is guaranteed. Therefore, after implementing a potential fix, continue testing until the sequence transpires. If the sequence manifests without the associated failure, you've successfully rectified the bug. Don't prematurely conclude after 28 attempts if the sequence hasn't surfaced yet.
It is easy to just neglect the warning signs and ignore the arguments by other people who insist on the existence of the bugs like customers, quality assurance people. You ego can stand in your way from understanding and perceiving the unknown nature of the bug. But you can't simply deny the possibility of the occurrences.
Know that "that" can happen
The absence of evidence does not mean the evidence of absence. Remember, people simply denied the existence of Black swans until they see one. The world of software and hardware is a type of Extremistan, where there is a high probability of unexpected events to happen. And when you see one Black swan, it's time to ditch your assumptions and think about devising a completely new strategy to find new ones.
Occasionally, a testing tool can find application in various debugging scenarios. When designing such a tool, it's crucial to consider its potential for reuse and ensure it remains maintainable and adaptable. Achieving this entails employing sound engineering practices, creating comprehensive documentation, and incorporating it into your source code control system. Integrate it seamlessly into your systems to ensure accessibility in real-world scenarios. Refrain from treating it as a disposable tool; your initial assumption of its limited utility might be inaccurate.
Occasionally, a tool proves so invaluable that it could even be marketable; there are instances where companies have shifted their business focus upon realizing that the tool they developed possesses greater appeal than their original products. A tool's usefulness can extend beyond your initial expectations, presenting possibilities you might not have conceived of.
If you really haven't any idea what could be wrong, life gets tougher!
Initiate the process by ensuring that you can prompt the bug to manifest consistently. It can be exasperating to pursue a bug that doesn't materialize with every occurrence. Invest time in crafting input data and configuring parameters that consistently trigger the issue, subsequently packaging these steps into a streamlined procedure that facilitates effortless and automated execution. Especially when dealing with a challenging bug, you'll find yourself repeatedly reproducing it as you unravel the underlying cause. Thus, streamlining the reproduction process will ultimately save you valuable time.
When confronted with a bug that doesn't exhibit consistent behavior, devote effort to comprehending the reasons behind its intermittent nature.