Causing a production bug is a rite of passage for new developers. While I swear I do more good than harm, I’ve shipped more defects than I care to admit. Let me tell you about the worst bug I’ve ever caused – and what you can do to not make my mistakes.
As you might expect, it deals with threading.
Let’s rewind to a time when I was a Windows Presentation Foundation (WPF) specialist in an ASP .NET web development shop. I undertook a project to convert a WPF application to Silverlight so more users could use it in-browser and on non-Windows operating systems.
While WPF allowed synchronous web service calls, Silverlight required asynchronous web calls. To be fair, I should have been using async calls to begin with, but tech debt can be hard to shake.
Here’s the thing – I hadn’t done a lot of async code before. I’d done bits and pieces, but taking the entire web service layer and converting it to async communication was new to me at the time.
No big deal, right? As developers we routinely do stuff we’ve never done before. On top of that, we didn’t have all the nice language features we have today for async code.
The problem is that when you switch from sync to async communication state management behaves differently.
Instead of one thread talking to your data model, you now have potentially multiple threads at once talking to the same objects. That means that either everything needs to be accessed by one thread at a time or these objects need to be designed with concurrency in mind.
In my case, the app would fire off a request or two to the web service and they now could return in a random order or at the same time, with each response being handled by a new thread.
Because I was so new to async web service code, I didn’t know the best practices around multi-threading in .NET, and didn’t have anyone better at me in desktop development so I did everything in the best way I knew how.
Testing everything locally, it all worked fine. Similarly, we didn’t find anything during testing or staging. And so, with great fanfare, we pushed the code out to production.
What could go wrong?
When things hit production the errors started pouring in. Customers were angry, coworkers were irritated, and I was extremely confused as to why the code suddenly didn’t work.
I’ll give you a hint: Our testing environment was located on premises.
The extra latency talking to a remote server caused enough delays to make the async code behave differently in production.
Because of this change in behavior, items were being removed from dictionaries due to one call completing, then other threads were trying to find them, leading to
NullReferenceExceptions in catastrophic proportions.
As illustrated in the diagram above with a somewhat contrived example, if you’re not planning for calls to return in potentially random order, code may behave in ways you’re not expecting, including encountering errors.
Other engineers got involved and we went over issues. We then identified problematic layers where issues clustered together. Once we knew what types of issues were causing problems, we made the necessary changes. This restored service and the incidents were eventually forgotten by our customers and, eventually, my coworkers.
We didn’t use the 7 basic tools of software quality for that investigation. However, the analytical approach we took of looking at all related defects was similar to that process.
So, safe to say, but we learned a number of lessons from this episode:
Obviously, my failure taught me a lot about threading.
When you’re working with threads, you need to put a greater degree of care into the design of your code. In particular you need to think about:
- Which threads will work with which objects
- How objects should handle concurrent behavior
- How you will test multi-threaded code
- The overhead needed to create threads
Thread safety is too large of a topic to cover in this short section. That said, you should generally adopt one of the following strategies:
- Use a
lockobject to synchronize threads and restrict objects to being used by one thread at a time
- Explicitly design key objects to be thread-safe or built in concurrent collections
- Hand off work to a dedicated thread that is the only thread working with the relevant objects
We can’t test our software from every customer’s location or even using their data. That said, there are a few things that can reduce the risk of shipping new code.
I strongly recommend considering a preview environment based on production data before major releases. This can help identify issues specific with different regions or sets of data before deploying features for real. Of course, you’ll still need to convince users to try this environment.
We would have found these problems during testing if we had used a cloud-based server during testing.
In short, make your development and testing environments look like production. You’ll find the more issues before you release to production and your users will appreciate that.
You need to find someone that you can talk through code with. This brings about better results long-term. This is valuable even if the other person never gives you a word of feedback. The act of walking another developer through a problem opens up new ways of thinking about the code.
This has the side effect of getting other developers familiar with what you do. This makes it easier for others to help with deadlines or investigate issues.
Involving others also makes it easier to go on vacation without worrying. Your organization needs to function if you’re ever sick or move on, so involve them now.
In my case, I felt that I had found a resolution to my application issues and that the bugs people were encountering were rare and dealing with data oddities instead of a larger, systemic problem. Because I didn’t want to believe that my code was bad, I waited longer than I should have to investigate early reports of problem.
In particular, fear confirmation bias when testing your own code. In my case, I’m not sure what I could have done to replicate those issues without an off-site server, but at my core, I wanted and expected my code to work properly, so I might have shrugged off any oddities encountered occasionally in testing.
In my case I didn’t know the full extent of what I didn’t know. I know I encountered enough issues during development that I should have raised a red flag and sought out some technical training on threading.
Leading up to this release I was working at an incredibly fast pace trying to meet the organization’s needs in adding feature after feature to release to market. While we understood that my pace was unsustainable, I assumed the organization also understood the quality risks we were taking by working at that pace.
They did not.
You can think of blaming this on the organization, but the first responsibility for communicating quality and technical debt risk lies with the development team and, in this case, I assumed others knew or were thinking about what I was thinking and this was wrong.
Even though my code wasn’t necessarily bad, the knowledge gap I encountered mixed with the lack of a proper testing strategy led to a severe influx of bugs.
Beyond that, the bugs manifested primarily as a
NullReferenceException which has a stigma as being associated with beginner developers.
It took years before I lost the perception of being a sub-par programmer from this set of mistakes and my career suffered from it as well.
When it was all said and done, we released a product that met our user’s needs and filled critical strategic gaps for the organization, helping us continue to expand and grow.
Additionally, the success of the application, while no longer fully my own, is something I am proud of to this day, and my imperfections taught me much about software development, architecture, testing, and communication.
What was your first production bug? What was your worst? No company specifics please, but I’d love to hear how you’ve learned from your mistakes.