Sigh
I’m opening with a sigh because several months of flipping switches, wiresharking, and scrolling through logs did not teach me much. I have coding experience reaching back to the days of the Nintendo 64, and yet, most of it proved useless to me this time. Was I lost in the jungle of cryptic frameworks? Bushwhacking through the brush of spotty documentation? That’s what I keep asking myself. But as I observed the pirouette of bits and bytes scroll by my terminal window, I mused on how this intractable exercise left my wheels spinning rather than my brow perspiring.
What I didn't know was that I had come face-to-face with a bug that would teach me a lot about myself
Architecture Complexity
Our enterprise application runs a handful of microservices in the cloud bound together with the Google Remote Procedure Call protocol (gRPC), a framework for microservices to communicate with each other over a local area network. One of these microservices consumes events produced from other microservices.
too many events spilled out into the void of cyberspace, and there was no clear indication why. All I had to work with was this spuriously mysterious message repeated over and over again—ad nauseum. It was the same message that would torment me for the next several months no matter what I threw at it:
Upstream producer failed with exception, removing from MergeHub now
This proved to be only a tiny piece of a larger puzzle. I have learned a great deal about these frameworks involved, but little of that proved useful to diagnosing and solving this spurious problem.
Virtual Ghost Town
There wasn’t a great deal of documentation available for the frameworks involved, and what was available was often wrong. Furthermore, community support forums were sparse ghost towns. I was fortunate to get any reply at all. In retrospect, I found that no documentation was more helpful than poor documentation, and there was no substitute for someone willing to listen, understand, and provide guidance.
I found that no documentation was more helpful than poor documentation
I was on my own. Either that, or I was exhausted and disillusioned by googling. I can only think of a handful of times during my career where web searching was no help to me, and this was one of them. It took me several months to solve it due to the exhausting and demoralizing nature of it. I keep having to set it aside to work on something else so that I could experience a win every now and then.
Reproducing the Issue Locally
Running our services locally did not cause the issue, but I knew if I were to make any significant progress diagnosing the problem, I would have to reproduce it locally. Eventually, I was faced with a conundrum—Continue working with the existing code, tweaking it as necessary in an attempt to reproduce it, or invest some time to write custom producers and consumers based on the original code. The latter offered a great deal more flexibility and control, but I was dissuaded for some time by the opportunity cost of writing two brand new throw-away components.
Ultimately, I bit the bullet and wrote the custom sub-modules despite feeling hopeless about it and doubtful that it would change anything.
I was wrong. It changed everything.
Not only was I able to reproduce that stupid error message, but I was able to reproduce it reliably. I just needed sufficient volume to overwhelm the consumer like a burst dam.
Despite finally managing to reproduce the issue locally, I still lacked clues. There was nothing obvious from the code, but it was clear that the framework didn’t like something about it.
Alone in the Woods of Static Analysis
One of the most laborious and mentally exhausting exercises is the static analysis of a code to understand how it works. This was the next step in attempting to diagnose the root cause of the error message. One of the quickest ways to hang yourself with this approach is to make assumptions. In retrospect, I found I didn’t even realize many of the assumptions I was making while reversing the third-party framework generating the error. I had to constantly remind myself that you don’t know what you don’t know, which became a personal mantra for inductive reasoning.
I spent weeks in careful static analysis, often reviewing the same code paths more than once or twice. I had a couple of doubts floating around in my head while I was on this quest—doubt that I would find any reliable answers, and doubt that I understood how the framework actually worked. With time, I enumerated a few potential outcomes (given certain inputs), and one led to a surprising discovery.
Eureka
After several months, I cracked the case, and it was a two-line fix. What happened was the consumer kept shutting down these short-lived streams prior to consuming all of the events contained within it. It was an incorrect assumption made by an engineer (no longer with us), and I ignored it because I trusted he understood it accurately. This was another important lesson I learned— Never skip over a code written by other engineers and assume that it works as intended. Normally, code review would stop bugs like this in their tracks.
Lessons Learned
To wrap-up, here are the top three lessons I learned from this epic saga.
- No documentation is more helpful than poor documentation.
- Sometimes, it’s worth biting the bullet and absorbing the opportunity cost of writing some throw-away reproducible test case components in order to troubleshoot a complex problem.
- The saying, “you don’t know what you don’t know” is a great reminder to be more careful about making assumptions
These three lessons have one thing in common: incorrect assumptions lead to incorrect conclusions.
Conclusion
I hope my story helps other developers who are stuck with a difficult, intractable, and flaky problem that isn’t easy to reproduce. If I were to do it all over again, I would periodically review and examine the assumptions I was making thoroughly because if there’s anything I’ve learned from this journey, it’s incorrect assumptions eventually lead to incorrect conclusions.
Top comments (1)
Great article. Thank you for sharing your experience and insight.
How were the unit-tests looking for the problem repo? Were you able to lean on them to help you troubleshoot? Other than documentation improvements, can you think of any other practices that may have avoided that particular issue to begin with or that would have been helpful?
I'm always looking for general ways to improve, protect and improve code-quality. Eespecially with the complexity and time-investment required, your scenario sounds like a good one to look at with that perspective.