A few weeks ago, replying to a tweet about an elusive error, I realized that I hadn't seen much online advice about how to deal with this kind of bug.
What kind, you ask? Well, the kind that always stings you in production, but hides when you try to debug your code.
So I thought, should I try to write something about it? And here we are!
This article won't provide the silver bullet that kills all heisenbugs - unfortunately, there isn't such a thing. It aims to describe the situations where these shifty errors can arise, so that you can try to avoid them first, and then discuss a few tips to hunt them down and save the day.
Strictly speaking, a Heisenbug is that kind of bug that disappears when you are debugging it 😨.
Yeah, I know, that sucks... it's the worst kind of bug.
A looser definition, though, also includes all those bugs that appear and disappear (apparently) at random, even in production.
These are also pretty nasty to solve.
There can be a few reasons for a bug to behave like a Heisenbug: when a bug is only noticeable in production, but hard to reproduce while debugging, usually it's the context that makes the difference:
- Timing: if, while debugging, you run instructions line by line, the delay between operations might be orders of magnitude larger than in production; this is especially relevant when the Heisenbug is caused by the interaction of different thread.
- Memory: During debug the addresses of variables might change; running code compiled without optimization might also cause some variables to be moved from registers to RAM, and this, in some languages/compilers, can affect the precision used for floating point comparisons.
- Assertions: in production, or whenever your code is compiled with optimization options activated, assertions are usually disabled, while when compiling in dev mode locally they might likely be active (I got burnt myself many times because of this difference, especially with C++ or Python). Evaluating assertions might have side effects (though you should be careful for them not to have any) or simply affect the timing of execution.
- Side-effects of debugging: adding logging or prints change the likelihood of the "wrong" interleaving between threads; the expressions checked in the debug's watches can have the same result, or even worse (if you are not careful) have side effects.
- Latency: if you are running debug locally, or with mocked services, the latency of sync calls will be orders of magnitude smaller.
- Race conditions: for the looser definition of Heisenbug, concurrent executions are highly non-deterministic (from the developer's point of view), because the execution depends on operating systems, and on the resources available at runtime. Many of these bugs only happen when certain edge conditions are faced.
- Randomization: if you use randomization in your code, that might also cause inconsistencies across different runs, and the right conditions for your bug to happen might emerge discontinuously - for instance, in sorting a random list, edge cases (like already-sorted input, or empty input) might emerge only in rare cases.
- Narrow down the portion of code where the bug happens. It's not always easy, because a race condition can cause an error to show up later in the execution, but you should try to focus on the smallest portion of code possible.
- If you are running a data-transformation pipeline, store intermediate results at every step and compare them over multiple executions (or between local and production runs).
- Check your code for random generators: try to test it by mocking the random generators, and track down the random output for both the happy and buggy runs.
- Double check your input: sometimes different results are caused by slightly corrupted inputs - it's kind of a long shot, but by checking upfront either you rule it out early, or - in the lucky case this was it - you save wasting hours debugging your perfectly-working code.
- Make sure that your patch fixed the error: with a Heisenbug it might be difficult to ensure that the bug was truly solved and not simply masked.
- Run performance testing to identify bottlenecks, but also upfront, before a bug even shows up, to decide which critical areas need to be optimized, and if it's worth parallelizing execution (since it makes your code more fragile, limit concurrent execution to the critical areas, where you really get a sensible improvement).
- Run stress tests to identify potential edge cases under extreme conditions. Remember that testing can only show the presence of bugs, it never proves their absence: stress tests help lower the chance of unnoticed (heisen)bugs unexpectedly popping up in production.
Use ad-hoc algorithms to detect potential race conditions:
- The lockset algorithm reports a potential race condition when shared memory is accessed by two or more threads without the threads holding a common lock. It might report false positives.
- The "happens-before" algorithm is based on partial ordering of events (i.e. any instruction, including read/write and locks) in distributed systems, within and across threads: if two or more threads access a shared variable, and the accesses are not deterministically ordered by the "happens-before" relationship, then it reports that a race have occurred. This algorithm generates very few false positives, but it's sensitive to the order of execution, so you might need to run it several times before catching a race condition that's causing a Heisenbug.
- Reverse debugging: the ability of a debugger to stop after a failure in a program has been observed, and go back into the history of the execution to uncover the reason for the failure.
- Use ad-hoc debugging tools for race conditions. A few examples (there are many more):
- Use a reverse debugging tool:
The term "bug" was coined by computer pioneer Grace Hopper, who was working on early electromechanical computers, the Mark II and Mark III.
The story goes that, back in the early days of Mark II, Hopper traced an error in the Mark II's operation down to a moth trapped in a relay, which was then carefully removed and taped to the log book (see the posts's cover image).
That's why, from that first actual bug, today we call errors in code "a bug".