Cover image for 🍿What was the most remarkable πŸ¦— bug you have ever had to fix πŸ‘©β€πŸ”§?

🍿What was the most remarkable πŸ¦— bug you have ever had to fix πŸ‘©β€πŸ”§?

lifelongthinker profile image Sebastian ・1 min read

Share stories about the most remarkable bug you have ever had to fix.

Could be a really silly one, a very tough one, one that almost cost you your job, one that had incredible ripple effects, one that was never noticed by the user(s), a very funny one, or a downright embarrassing one.

Let's hear some good bug stories 🍿🍿🍿


markdown guide

Back in the agency days we had this client that was a real piece of work. He was very accomplished in his field and his ego was the only thing that rivalled the amount of data he was willing to cram into his personal website.

We had to make lots of unplanned changes due to him becoming suddenly unhappy with features that had been approved earlier, and management wouldn't do much because his account was rather big - and a retainer.

It got to a point where we stopped arguing and would numbly implement his requests, throwing any initial website strategy away and simply acting as voice-powered website implementation machines.

To add insult to injury, one day he asked for an autoplaying music playlist to be added on his website, in every single page, and that's where we get to the bug that I had to fix.

Few weeks after the implementation, he complained that the first song of said playlist wasn't repeating after the end. Now, I can't remember why adding a loop feature was so hard at the time, but instead of doing it I simply uploaded the same song 10 times. The first one would finish, and the next would start again, and so on.

It wasn't perfect, it wasn't particularly good for the users, but halfway through to project we realised that this website would only have one user, the client himself.


Awesome workaround 😁 Eat your own dogfood from the client perspective. Nice story, thank you so much!


Funny, thanks for sharing. I hate those kinds of clients.



I worked at Linden Lab (which runs the virtual world Second Life) for over five years. There were a ton of amazing bugs while I was there, because bugs involving virtual worlds of any kind are almost always hilarious. (Read the patch notes for The Sims if you want other examples.)

This is my favourite Second Life bug story. It happened while I was there, but I wasn't involved in fixing it, I just found out about it the next day. Years later, I tweeted a thread about it, in response to this:

Here's the text of the thread:

I know an even weirder version of this, also from SL. This is from when I worked there, about 7 years ago. It involves skating horses.
Similar to the bunnies, there were some other virtual pets in SL that were Arabian horses. Just like the bunnies, you had to buy them food.
You’d put the food out, they’d find it using some pathfinding code written by the horse creator - before SL had native pathfinding.
Now, when there are new releases of SL, it isn’t fully tested with all the scripts that content creators write. Far too many to do that.
(The QA issues around Second Life could fill a book. Many of those issues could/should have been avoided. Not going into it now.)
One release had a tiny tweak to the physics engine, related to friction of objects moving on the ground.

You may guess what’s coming next.
The horse pathfinding logic was using the old friction rules. As soon as the SL region code updated, horses started sliding past their food.
In some cases, horses living on high-altitude platforms started falling off them. (I imagine them whinnying as they pirouetted into space.)
Now, this all seems really comical, until you realise (a) how many people owned these horses, and (b) how much they’d spent on them.
So, within a couple of hours of the code going live, staff realised that US$X0,000 worth of user possessions were being destroyed by a bug.
(Yes, US dollars. Not Linden dollars. Just in virtual horses. You have no idea how much the SL economy is still worth.)
Once the size of the problem was realised, the code was rolled back. But this still takes several hours over the thousands of servers.
So, during the rollback, several dedicated QA engineers stayed up much of the night, saving virtual horses from starving to death.

This is why virtual world bugs fascinate me. Some of them are just AMAZING. (There was another one that raised the water level 200 meters.)

My other favourite SL bug happened before I joined the company. It had very little to do with the virtual world, but was amazing for completely different reasons:


Hilarious! That's a beautiful one.
My takeaway: Don't invest in digital pets and don't ever buy digital goods for them.


Don't invest in digital pets

Reminds me when CryptoKitties became the viral app of Ethereum.


I joined a startup in their "stabilization" stage recently. Suffice to say the code quality is subpar.
I was tasked with consolidating the code. First thing I do is start to build a test suite. The framework for tests was already there (RSpec for a Ruby on Rails app) but there was no tests at all. So I write an exemple test and attempt to run it.

Turns out, every time I ran the test suite, A WHOLE FOLDER OF CODE WAS BEING DELETED.
This was the first and only case I ever met where the app was literally deleting parts of itself.

The reason in the end was a weird callback that was trying to reset the cache of the application by deleting the cache folder. Worst, it did so with the use of a relative path instead of an absolute one, so calling this method on the wrong place had those unintended consequences. Suffice to say that is the opposite of a best practice and I rewrote this shit immediatly.

I still work there though, mostly because my colleagues are fantastic people.


Another true story. I was working on a bug with a colleague, we were able to recreate it at will, but could not find the root cause. Both of us looked over the code a zillion times, until one day we realized that the O should have been a 0.


I see this problem a lot from a UX point of view. Password and license code boxes (and the like) are especially error-prone here. If systems use a sans-serif font and 0 (zero) vs. O (letter), and 1 (numer) vs I (capitalized i) vs. (l non-capitalized L) are not distinguishable, this can lead to problems (especially if you cannot use copy/paste, which in itself is an issue when security is concerned). I especially hate my KeePass version, which makes this very mistake.


Today, the zero on Windows has a diagonal line through it.

You mean in the default fonts? Luckily that's true. They have figured that out even on Windows πŸ€ͺ

Ya I hear you I'm not the Windows fanboy I once was.


When I started programming I spent a good part of a day bashing out a bug I had in Python. As I just started out, my debugging skills weren't very fleshed out to the point I restarted my computer a few times in a rash attempt at fixing the problem.

The actual problem was stupidly simple, but due to me lacking the insight, experience or plain knowledge of understanding what the error was saying, I was second guessing all of my limited knowledge of programming and questioning reality.

I eventually fixed the bug by re-writing all of my code from scratch and it "disappeared". I then spent more of my day trying to figure out what underlying issue was instead of moving on as I didn't want to just "move on" if it seemed like the code was magically working.

The bug was I mixed tabs with spaces for indentation, which wasn't very noticeable in the editor I was using which was notepad.

Morals of the story:

  • use better tooling
  • understanding an error is just as important as fixing it

I recently spent two hours trying to figure out why Spyder couldn't import a module I'd just installed, when it ran fine from the command line python3 interface.

Turns out I'd mistyped it in my test script and didn't notice until I made a new one. (These two scripts say EXACTLY THE SAME THING, why does one work and the other one.....oh.)


How did Python react when confronted with a misspelled module name? I'd expect some kind of (helpful?) error message to stderr or, at least, stdout? Or did you just run it interactively?

I was initially trying it in Spyder's integrated IPython console, and eventually tried that file from the command line also when my second try worked (which also failed and made me realize it was the file, not an environment error.)

It did helpfully raise the ModuleNotFoundError, but since at that point, the whole purpose of the file I had written and my running it with nothing in it but an import statement was to verify if I'd successfully a) installed the module to my venv and b) successfully ran Spyder out of the venv to which it was installed.... I thought it was an error in either a or b and was trying to track it down :D

πŸ˜‚ We have all been there.πŸ‘πŸ‘


In the same class of errors, when I moved to Norway 3 years ago naturally I started using the Norwegian keyboard layout. That's the second time I found out about non-breaking space, the first being the old days when it was common to add a   in an otherwise empty DOM element to retain the layout.

With some sequence that I never figured out, the normal space would be replaced with non-breaking space. Unlike the human eye, most programming languages make a pretty deep distinction, and the program would just crash upon encountering one outside of a string. Since that would happen rather frequently, I ended up making my own keyboard layout that supports 3 languages I use, and 2 additional for writing names


That's a good one. Multi language/culture use and keyboard layout still remains an issue to this day ☹


Meaningful white space can be a PITA if you are not used to it and/or your IDE doesn't support you with some visual means. On a personal note, I find the mere concept of something so meaningful in something so totally non-iconic as white space quite non-appealing, to say the least.


Many many moons ago, panicky customer on the other coast calls to report their system is completely erased. It doesn't happen just once. I got a trip to Baltimore to try to calm them down (and some really nice meals with some really nice people). To skip to the end: (1) they have the system plugged into an outlet in a cubicle wall (2) the modular cubicle walls plug into each other, there's not a single uninterrupted wire (3) someone bumps the wall and there's a brief power glitch (4) system doesn't crash, but the power blip appears on the system bus, and is interpreted by the disk controller as something worthy of generating an interrupt over (5) we have a recently rewritten disk driver, where the programmer had recoded the interrupt routine to "decremet the pending count, then look for more work to do". But in the idle system, (0 - 1) made it look like a non-zero number, so there's work to do! driver blithely counted down disk block numbers from a (16-bit) -1 writing garbage . Result: wiped disk. (a little more complex than this, but close enough) Oops.


One of the most annoying bugs I've worked on involved the following:

A customer on the other side of the world (barely any timezone overlap) reported that the memory usage of one of our components would continue to climb linearly in one of their environments until the process essentially got OOM Killed. Sounds like a memory leak right? It wasn't cause forcing the Ruby garbage collector to run would manage to free up most of the memory!

Got to spend weeks going through heap dump after heap dump trying to find the culprit. Turns out it was due to a bad ORM call that was loading a collection into server memory to filter instead of doing it in the database. It didn't cause problems for most users because this collection was typically small -- on the order of 10s of elements. In this environment, however, it was on the order of thousands. Very simple fix, but very difficult to hunt down!

Got to learn a lot about how memory is managed in Ruby, though. 😊

My team blogged about it in more detail here: engineering.pivotal.io/post/debugg...


Once the team pushed the reports, but haven't done a good check on the security level. Providers could see the salaries of each other! on the bright side, providers were either careless to check immediately, or we were just lucky, but we haven't got any complaints. It was embarrassing and scary and in fact, could lead to someone being fired at the end.


This would be a HUGE GDPR problem in the EU nowadays. πŸ™ˆπŸ™ˆ


I had just taken a position on the Communications team of a large mid-range company. This team was comprised of folks that implemented operating system layer code to allow communications between mainFrames , midRange and PCs. A heady job for sure, but as the newbie; I was delegated to work the edges instead of the action item work.

As I dug in I had found a bug that had been around for at least five years with no resolve. The midrange had at least three layers of code, just call it the top layer, the mid layer and the I/O layer. When this bug showed (about two or three times a year) it was particularly bad because it effected a complete network reboot. This mean thousands of "sessions" were lost. Getting things back to normal took long periods of time, and of course managers were everywhere by then.

After everything calmed down, I studied the code in the area of concern and decided to put in a massive amount of trace points so we could get a flight recording of all the events between the top and middle layer interface.

My manager did not want me to put out the trace, but I convinced him. He found a customer willing to try it and we were on for a recreation and data to review. About a week later he popped into my office to tell me the patch crashed their system. They had so much network activity the trace logs couldn't keep up. They pulled the patch and life went on.

About two years later another person unknown to me attempted a similar patch but with much less tracing. He found the now 7 year old root cause. All glory went to him, and rightfully so.

So close but so far at the same time.


It seemed to be the right call. Empirical evidence in debugging always works better than mere conjecturing, at least in my experience.


Not original but still got to be the best (worst?) bug ever - the 500 mile email!


That was a funny read. The FAQ is nice too.


Once I was called to do an SEO audit for an online shop. They complained that they were performing really badly in search results and wanted me to find the reason. Turned out, for whatever reason a developer had intentionally (!) de-indexed all their category pages. All that was left for search engines to index were general "About us" pages. I never found out why they had done it.


That's quite an accomplishment for a web devπŸ˜‚ How did he manage to do that without noticing? Disallow via robots.txt?


Yes indeed :D Every category page had a "noindex,nofollow" tag. My best guess is that they were trying to prevent some sort of duplicate content problem, but executed it really badly and/or didn't understand the basics of what they were doing.


Here's a recent one.

I'm working on a Python application with data being passed around different threads. There was this one particular piece of data that made me almost throw my laptop over the balcony.

So in one thread, a database table is selected and is serialized in to a comma-separated string. Then, this string will be added to a queue which will then be read by another thread. The other thread will split the string by commas and will access each value by array indexing.

The problem is, the thread the reads the data seemed to be accessing indices more than the number of columns from the original database table! It turns out that some of the serialized values had inherent commas in them, so the indices naturally increased.

There's a special place for the people who designed our original application, and it's not pretty.


That's a classic within CSV handling in particular and the "data vs. metadata" department in general.


We had a CD build running that utilized the 1.2.3-b456 naming convention (major.minor.patch-buildNum). One of the devs decided it would be cool to include the name of the artifact in a properties/config file. How did we achieve this?

  1. Run a build.
  2. If build was successful => update the property in the file with the new name. commit that change
  3. Check for new commits - oh a new commit. Repeat step 1.

I caught it early (i.e., after 20 builds) only because their build was dependent on one of my builds and it kept kicking off new builds on my end. Good times.


Damn, and your bosses must have been so happy you were pushing your CD pipe to the limit with all your increased productivity (as measured in builds accomplished) 🀣🀣


This is the best... A new tape deck unit for saving data was to be supported at the operating system layer. I changed all the code to support the unit based on specifications that were given. We did a ton of testing and shipped the code. About a year later we kept getting reports of the tape deck fires that companies were seeing using that new model. In addition, we had heard the data restores were losing data. I had a college call me directly to see if I could get data off of the tapes they sent in because their student's thesis' were on them.

I spent countless hours going over the code to see if something was wrong and came to the conclusion that there was no way the code could have caused any of these symptoms. The problems continued, until we were flown to a customer's site. We being, myself and two hardware engineers with lots of equipment. They told me to create a large batch of data to save while they had probes attached to the hardware. After about two hours they found out that when the hardware buffer ran out of data, it wasn't signaling properly that it was empty! This caused the tape drive to stop, rewind a bit an attempt to lay down a new track of data. After doing this for hours, these tape decks would overheat and catch on fire!

The engineers went back to the office and I never heard another word on the issue. Other than they stopped the fires somehow.


One of those rare cases (at least in my line of work) where software problems could potentially cause life-threatening problems.


Many, many years ago when I was a grad student I was writing a model based on spherical harmonics and I couldn't get the results to come out even close to the expected answer. Eventually, after stripping more and more code out of my model I came to the conclusion that the pow function in the compiler was wrong and fortunately because in those days you could read the compiler's C code I could prove it was. Some one had decided that the result of raising an integer to the power 0 was 0 not 1 as the rest of the world did. In the end I had to take my big maths book down to system support to get them to believe me.


πŸ˜‚ x⁰ = xΒΉ/xΒΉ = 1 q.e.d.

☝️ for all x <> 0 that is


Once I had a little bug that wound up costing us a lot of customer data, because the JS I was using had setCustomer*Id* and getCustomer*ID*, or possibly the other way around. Not terribly interesting, but another reminder to be consistent about casing.

The more interesting one- Some years ago, I was working as the front-end developer at an e-commerce service company, providing smaller online retailers with recommendation services. The way it worked was that we loaded, on each client page, a JS config file containing client-specific customizations, a platform-specific library, and a file containing the core functionality. This also predated JS build tools as a real Thing, so we had to load each file separately rather than a single bundle.

Well, one day I was asked to add a certain feature for a particular client; this was a simple thing, just converting a piece of data from some computer-readable form to something more human-friendly, so I just put the function in the platform file and put the lookup table it referenced in the core, so it'd be available to other platforms if they needed it. It all worked fine, so I uploaded it to our CDN and waited for the change to propagate out.

A few days later I got a call about our code breaking a client's site. Apparently the new platform file had propagated out, but the core file hadn't, so when my new function went looking for its lookup table, it couldn't find it and broke the website of actually our largest client.

TLDR: I lost our largest client and almost my job because I didn't really know how CDNs worked.


This one I still haven't been able to fix. It's weird. We have an admin panel for a project at work, made with Angular. As usual, some input fields and buttons are disabled, this works as expected but a colleague who uses the panel told me he could click and write on disabled inputs and buttons... weird.

I asked for more info, and he had the latest version of Chrome and no extensions. He works from a Mac, just like me.

I haven't been able to reproduce the bug, and of course, haven't been able to fix it.

I even asked in StackOverflow and here on DEV.to, and nobody seems to have had this issue, and have no idea what It could be.


What does and inspection of the live DOM in Chrome dev tools reveal? Are the disabled attributes assigned? Or does the app rely on some kind of pseudo or custom classes?


Probably this one. I'm not going to explain it in detail, but it was basically a weird mix of inheritance-like cascading lookups and lexical scoping in a project that uses lots of meta-magic to hack the global environment.

From what I remember it ammounted to adding an additional layer of higher-order functions to bind a handler to the new environment for every inherited instance.

The commits leading up to it aren't exactly much better.


Back to the future.

Time is quite an important thing. So you want to set up your servers to be properly synchronized. How you do this can have some interesting problems. At a client our enterprise software has regular issues with transaction timeouts. The timeout was set at 5 minutes. Generally you implement timing based on number of seconds since EPOCH, or something similar. It is a number which simply increases, and is not affected by timezones, DST, leap seconds. The number always goes up. At least, when things are set up correctly.

So our software was suffering from transaction timeouts. Simple actions which would not take even close to 5 minutes were failed. We logged a lot of information. While inspecting the log output I noticed that occasionally the time went back a few minutes. Printed log statements with timestamp can jump ahead or back an hour or so due to DST. But not minutes. So something was going on causing the system time just jump back a few minutes, and a while later it jumped forward the same amount of minutes.

I did not have full access to that server. As the rebel I am I wrote a simple shell script which kept monitoring the current time (I was not allowed to do this by the client). It would check every seconds what the current time was, and if the difference was more than a second it would report it. After running it for a few hours the pattern emerged. The server time was in fact jumping forward and backward quite regularly.

Digging some more I found out that they chose a terrible setup for time synchronization. This was a GNU/Linux server. Instead of installed a NTP daemon they used a CRON job to execute ntpdate once in a while. The difference is that ntpdate will hard change the system time, where a NTP daemon will catch up the system time with the synchronized time by making seconds shorter or longer. The latter can take quite a while for a system to be synchronized, but at least time is still only going forward.

I was able to get the list of NTP servers they were using, and I could manually query them all. It appeared that one of the 4 servers was running behind a few minutes, the exact number of minutes I observed. I reported this to they client that they had a faulty NTP server in their pool, that the problem was in their infrastructure and our software was working correctly, but was suffering from the infrastructure issue. The response of our contact: "It is normal that time can go backwards, this is what happens during DST. You need to cope with that in your software." I tried to explain to hims that DST does not change system time, just the timezone and it only affects printed time. But he wouldn't listen, and kept insisting we change the software to cope with. I refused to even consider it or further spend time on this issue. I escalated to my manager.

This issue dragged on for I think a week until they finally fixed the problem at the client. During this time they were constantly suffering from issues due to transaction timeouts. Every time these issues came to use I pushed them back to the client referring to the original issue of their faulty infrastructure.

PS, I have had many more encounters with that contact at the client. He was supposed to be a technical person, as was regarded as so by many. But actually his was just a loud mouth incompetent person who could sling technical terms which seemed knowledgeable to people who did not have the real knowledge.


Nice catch. This imposter syndrome seems to be quite ubiquitous, especially in IT where there are few really technically apt people.

Also, I absolutely agree: Never do the obviously wrong thing just to please a client. That doesn't work out in the long run. You're a pro exactly because you withstand such "temptations".


There was a point in time where a particular function in MariaDB treated the value of 0 as invalid. Any other integer, positive or negative (within the bounds of the bit limitation) was fine. But 0 specifically would produce an error. And at that, only in certain circumstances.

This was due to the way MariaDB would compress some data. Values would be stored with length information. The value 0 required 0 bits to store, so length was set to 0, and since there was no length, the field was treated as completely invalid.

I opened a PR for the issue, and then the devs found another addition to the bug... It was possible to store either a signed or unsigned 0 (same value either way). So they had to update the code to support both cases.


Funniest, and one I'm probably most well known for on Twitter?

In my tired coding binge one night, I wrote the following in C++

x = x++;

I just wanted x++, but was too tired to realize what I had done. This is an undefined operation in C/C++. It just so happened that the compiler I was using at the time did what I expected. But then when I used the exact same, already "verified to work" code on another compiler, or even a different version of the same compiler, the code would fail.

That was also in the very early days of ARM embedded programming. There was little by way of ability to debug anything on the platform. Each iteration of the code required not only compiling it, but then flashing it to a custom flash drive, and then inserting that flash drive into the desired hardware. Development and testing was painfully slow!


I was working at a conference and had written some code to handle movement of objects on a 2m wide "touchscreen". This was before touchscreens were widely available, and it was basically just a film that was stuck to glass that fed back a series of coordinates, and my code had to interpret those coordinates as movements on the display (which was a projector onto the glass which had a semi opaque film applied ontop of the touch sensitive film)

We were told by the touch film company that the film needed at least a few days to set properly on the glass to avoid shift. Of course that didn't happen, the film was only applied the morning of the first day of the conference.

Cut to me every few hours needing to recallibrate the bounding coords of the screen as things shifted. What really made it fun was having to add that callibration functionality on the fly, from the other side if the glass (meaning the screen was completely mirrored backwards) because we didn't think to bring up an extra monitor.


During the start of my 2nd job, there was this one bug at a customer's site where we would generate corrupted sql load files. We used load files because this was a legacy application (VB.NET I think) and sql load files were faster in this use case.

This application ran as a Windows service. Turns out the reason why it generated corrupted load files was that it was running two instances of the service and the one process was overwriting the other process's file data. It took 3 business days and a weekend for me to figure this out since I had no experience with this legacy application and couldn't reproduce anything. Pretty proud that I could debug this VB.NET app as a Java Engineer during that part of my life.


SAP NetWeaver developer studio, integration of an SAP ABAP endpoint.

There are two screens in this dev environment that help you to integrate the endpoint into your Java code base. These two screens appear to be the same, yet they produced different results. In fact one would produce erroneous results and the other would work.

Took me a while to figure that one out.


Most remarkable bug I had to fix?

My life. πŸ˜‰

Great read! πŸ’―