Ben Halpern

Posted on Feb 3, 2021

What's the longest you've ever spent debugging a single bug?

#discuss

Top comments (60)

Probably a whole 9-hr work day and some change.

I was tasked with creating some Ansible configs for these build agents. The machines being spun up from them were identical, but spread across 3 different networks: A, B, and C. The big difference was one zip. A and B got it from shared drives, but C pulled it from our Artifactory. I was told that the one in Artifactory was the same from both A and B.

A and B were fine but machines on C were failing. I figured it was the zip, and it was...but it took the whole day and 2 30-minute Zoom meetings with different folks.

The problem? Well all 3 zips had the same name: Dir_X.Y.Z_14.0 but

The zips on A and B unzipped to C:\Path\To\Dir_X.Y.Z_14.0
The zip on C unzipped to C:\Path\To\Dir_X.Y.Z-14.0

A single-character typo brought me to my knees lol. Someone renamed the directory to have a hyphen, but the zip they created still had an underscore, lol. Ahh good times.

Zack DeRose • Feb 3 '21

If I ever get to 3 hours staring at the same bug, I generally get up and go for a walk or get some other eyes on it, or try to tackle a different task and come back to the bug later.

Maybe not related, but definitely have had long stretches where a certain bug is 'fixed' only to pop up again a week down the line...

Lucia Cerchie • Feb 3 '21

^this

Nicolas Bailly • Feb 3 '21

The most I've worked uninterrupted on the same bug is probably around a week. It was one of the worst bug I'd faced too : Some of our clients data would get randomly deleted for no reason and noone had any idea what was happening. I spent days trying to debug every single API trying to determine what could do that...

I eventually ended up parsing the mysql binlog searching for every delete statement on that table, searching where it came from in our codebase, and rerunning them one by one...

Turns out someone had forgotten some parentheses in an 'OR' condition months before.

Ashlee (she/her) • Feb 3 '21 • Edited

I spent several weeks trying to figure out why images from Windows Snipping Tool could not paste into Quill WYSIWYG and then a couple more weeks trying to fix it and work with other kinds of text and image pastes. I even wrote an issue for it that's still open! I've changed jobs 3 times since I wrote this and now I'm back to using it again in my current work project.

Cannot paste images from Snipping Tool #2539

ashleemboyer posted on Mar 13, 2019

A paste event is detected, but the images never show when you try to copy and paste images from things Windows Snipping Tool. Copying and pasting images from Google, for example, has no issues.

It seems like there is a timing issue for reading files with a base64. I have not been able to reproduce a "fix" I discovered in CodePen, but in the actual project I'm using Quill for, extending the Clipboard module and lengthening the timeout duration at the end of the default onPaste function makes pasting from Snipping Tool work. The bigger the image that needs to be pasted, the larger the duration needs to be.

Again, I am not able to reproduce a bug caused by my "fix", but in my project, lengthening the timeout duration causes two "regular" images to be pasted. I'm throwing this part out there in case it comes up for anyone else. It may be something in my project.

Steps for Reproduction

Visit my CodePen
Capture an image with Windows Snipping Tool
Copy the image and try to paste it in the editor
No image is pasted

Expected behavior: All image pasting should behave consistently.

Actual behavior: Cannot paste images from snipping tools.

Platforms: Windows 10 (I have not tested this on others yet) Chrome 72

Version: My project uses 1.3.4, but the issue persists in 1.3.6. The CodePen is using 1.3.4.

View on GitHub

Brad • Feb 3 '21

3 months, not non-stop obviously, but I continuously went back and tried multiples things multiple times. Even did a 100% full on re-install of the operating system.

The issue? Bad vim-airline fonts on my Raspberry Pi.

The solution? Run a command to update the firmware of the Raspberry Pi.

Mark Davies • Feb 3 '21

Can't remember precicely how long but probably 1/2 month to one month, it was a dotnet "thread starvation" issue, where it was just running out of threads to run operations. Had a lot of false flags and a lot of debugging to find the actual issue. I hope to never see that error again vietnam flash back

ferceg • Feb 4 '21

It was very-very long time ago, in the late 90's.
I wrote a little game in Watcom C (somewhere between Wolfenstein and Doom, only walls and simple floor, but with not only perpendicular walls). Trigonometric functions were very expensive, so I used a generated sin table. I copy-pasted it into the source, but it looked ugly, so I lined it up with leading zeros. It was a mistake, because 0****** number are octal in C, so very strange things started to happen on the screen. It took a few hours to debug this and after that I was literally banging my head into the desk.

ItsASine (Kayla) • Feb 3 '21 • Edited

I tend to be the sort to leave something and come back to it in a week, especially for intermittent issues that are hard to reproduce.

Though I did get a fun text the other day "why does this test not work in IE?" from my last job's devs... 1. because I stopped trying to do a hack to fix it and 2. because <linked them to the IEDriver GitHub issue where Selenium said making IE work was so low priority that they didn't care typing was broken>

Technically I probably spent 4 years with those tests working in IE barely half the time because of timing around typing. I solved it by getting a new job :P

Mélanie Lelaure • Feb 3 '21

Unfortunately, I spend once, 3 months on a single bug. It was very long and I got desperate about it.

It was with a teleoperation application with a universal robot. The robot demo worked correctly in our office but not when the commercial did the demo at UR. For some reason, one demo was working, not the other we didn't seem to have any communication between the robots. Well, the second demo was working for 2 to 3 minutes and then both robots stopped, and stay blocked. It took a very long time to solve this, because we didn't have the setup to reproduce the bug in our office and I was not in a good health condition too.

Finally, the problem was from our robot dll library. They were an update, I was not aware of, between both demos. A colleague decided that a division by 2, which was not documented, was not "supposed" to be necessary. So he had removed it from the real-time library without further notice and pushed it in production. The result was that communication between both robots was not running at the correct speed. It was 2 times slower, therefore the teleoperation was not possible. Several months later, the bug is still in production, because "it was not the problem". Well it solves mine actually and I had figures to prove it.

This happened some time ago, I am not in this company anymore.

Stephen Belovarich • Feb 3 '21

Recently 5 days, off and on between meetings. No stack trace, just a build that kept slowly moving along taking almost 1 hour until I tracked down the culprit: Emotion 10 and how it handles type definitions can slow TypeScript compilation to a crawl. I figured it out by looking for similarities between packages that were slow in a monorepo, then commented out code until I found what caused the slowness and got the build down from 45 minutes to less than 1 minute.

geirawsm • Feb 4 '21

Not specifically a bug, but hear me out.

Me and a colleague found out that we want to make a tool specific for our work. We're both into programming and seem to know our stuff, but is not our primary work tasks and we're not hired as such. Pretty easy stack tbh: sql and php. He did backend, I did frontend.

I set up sql-server locally with all the correct tables and got my colleague's code and started my tweaking.
At first I was having some issues with running the php-site directly through php -S localhost:8000 . and connecting to the database. Having some experience with programming in Python and knowing that a clean environment is the best environment, I thought why not just make a clean virtual machine with ubuntu server and XAMPP. Set it up with a NAT network adapter and forwarded ports from localhost to the VM. The I installed the newest Ubuntu Server and started coding on that instead.

But I experienced the same issue.

Start DBeaver to check out the db, yep seems fine, the db and tables are all there and looks great. I have another go. Same issue.
As a dirty fix, I started coding directly on the staging/prod-server just to make sure that my changes are working as intended. They do, and gradually it crawls to a completion.

It's only after two months and about 200 commits later I realise that I never stopped the local sql-server running on my machine and changing the host and credentials to the sql-server.

It was the same database the whole time 🤦‍♂️

Xucong ZHAN • Feb 4 '21

On or off for about a month.

A Quill editor was rendering incorrectly, where all the line breaks between paragraphs (p tags) are stripped and multiple paragraphs are stitched together, but only in some specific Vue components (a fact that I should've paid much more attention to). Tried and mix and match every single configuration (well, probably not but definitely over a dozen) I can find in their documentation and their Github repo. Finally, I even dived into the source code, but with little achievement.

Finally, the cause was stupidly simple. In those components, paragraphs were styled with either display: flex or display: inline. I literally jumped and cheered at the moment, half celebrating my success and half laughing at myself. Nonetheless, a GREAT lesson learned. :)

Matthew Bidewell • Feb 5 '21

Ha! I have a painful one. Took me nearly a good 3 days to find it. (between tackling other stuff when I hit a wall)

Note: I'm in GMT timezone.

The company I work for has an analytics view which takes a deep dive into the analytics of media the company serve. In November 2019, we got a message from a client saying numbers from our excel download functionality don't match that of their internal systems.

The numbers started off fine, but then massively increase after an arbitrary date. (clue 1)
The client was on the west coast of America, we provide all our analytics in UTC time (clue 2)
The client had multiple occurrences where the analytics was wrong after the arbitrary data. (clue 3).
I didn't have a problem when getting the data. (clue 4).

The problem?
Daylight saving
Without going into specifics.. the problem was going back an hour and then calling .startOfDay() on that date meant we would end up with two days worth of data after daylight savings.

Painful to find...easy to fix.

Brian Douglas • Feb 3 '21 • Edited

1 year.

TLDR; I wrote a GitHub Action using Docker and Bash without knowing a lot about both. Someone let me know it wasn't working for them so I spent probably an hour a month looking at it for about 6 months before forgetting about it.

Eventually, someone else opened a PR to fix it. Open Source FTW!

and now a link to that PR github.com/bdougie/invite-based-on...

G.L Solaria • Feb 3 '21

I spent a week trying to figure out if I was doing something wrong or if I had found a genuine bug in WCF. My gosh it almost broke my spirit. I don't think it has been resolved yet.
github.com/microsoft/dotnet/issues...

Phil Ashby • Feb 3 '21

Ewww, nasty. I too have spent waaay too long reading the source for WCF when things do not behave as documented/expected! Probably the longest was when investigating session leakage while using the WS-SecureConversation protocol. It seems absolutely nobody else in the world made that decision, and we probably shouldn't have either, but customers were now using it (30k+ of them) so we had no choice but to find & fix the leaks.. all told a rotating team of 3-4 people spent ~1 year (over a period of 6 months) finding all the ways customers could break stuff and patching up the server side...

Just before I retired, we had a plan to emulate the session aspects of the protocol, and I had a POC working which avoided actual server-side sessions, it employed JWTs to carry the security session data back and forth instead. This would have fixed a lot of problems with state management and scalability, I have no idea if it got implemented!

View full discussion (61 comments)

DEV Community

What's the longest you've ever spent debugging a single bug?

Top comments (60)

Cannot paste images from Snipping Tool #2539

Read next

How to estimate a task?

Hey Chads, share your Neovim and Tmux Config

Demystifying Buffers in Node.js

Bash is a terrible as a programming language, but what's the alternative ?