In sum.: three weeks with 2-3 developers. It was a corrupted ponter in a medical device. The issue was really hard to reproduce and even harder to understand the root-cause. At the end we found a couple of threads which tried to release a pointer and only one implementation of those three threads was broken. And kt was a legacy codebase without support from the authors.
Used to do DevOps before they even called it that way: Linux. Python. Perl. Java. Docker. For fun and profit. CTO level generalist working for a mid-sized tech-centric company.
Dresden, Germany
Depends on how you count, probably. Weeks to months I'd say, in terms of a quite peculiar persisetency bug bringing an application server to a screeching halt then and now for no obvious reason. Fixing this was rather trivial as soon as we actually understood what went wrong. 🙂
Front-end developer since 2016. Focused on React with GraphQL while studying software architecture, design patterns, emotional intelligence, and leadership.
1 week and a half, trying to overwrite a css rule from a .net core app causing styling issues on a child react-based app. I needed the help from another dev for 4 days until we get the fix. Man... that was a challenge at another lever for me.
I'm passionate about web development and design. A team player who treasures effective communication. Eager to learn as much as is humanly possible on my road to web-development knighthood(haha).
Location
Nigeria
Education
Bachelor of Engineering, specializing in Electrical and Electronics discipline
Three weeks. This was when I first started out in web development. I fixed a bug that prevented the project from building in heroku but I kept pushing to the wrong git branch(I used git push instead of git push origin master). So when I pushed again to heroku it would fail over and over again. I have never made that mistake again.
Over the years time scales have shrunk for both release cycles as well as debugging, I started programing on hard real time systems, using assembly code. Then we would normally achieve one (occasionally two) release a year. The system would was expected to be in service for a minimum of 10 years.
I can think of one intermittent fault on an interface between two systems which took me nearly a decade to find. This wasn't continuous effort but I had at least three attempts at resolving it. By the time I started looking at the bug the system was already a legacy system with a replacement contracted through our competitor wa on its way. I was a Junior engineer known for having an aptitude for low level coding, so I was put to work. Not having access to one of the two systems I could only review the code and write a report.
Two or three years later out customer dug up my report and agreed we could have access to both system, the catch was I only had a single day on site at the opposite end of the country. The nature of intermittent faults when debugging is the fault will not occur, true to form the system ran perfectly for the whole day and not useful information was gained.
As luck would have it our competitor failed to deliver on promises and our legacy system got a life extension and was rehosted on new hardware. I lead the software effort and in went in to service at which point I left the project. Once in service the original intermittent fault came e back with avengeance, our customer was not happy. I go seconded back to help fix the issue, we enhanced our simulator to emulate the other system and started debugging. Eventually I found the issue which we traced to using the wrong entry point in an error recovery routine in the Real Time OS. The programmer some 22 years earlier had types a 5 instead of a 3. The junior engineer who modified the simulator for me was younger than the bug! Having fixed the bug I was reminded of the report I had written nearly 10 years before, which correctly pointed to the exact error routine at fault.
I graduated in 1990 in Electrical Engineering and since then I have been in university, doing research in the field of DSP. To me programming is more a tool than a job.
A couple of days; an issue with pointers in a C program.
It began as usual: the program dies with a segmentation fault, open it with the debugger to check where the fault happened and... the stack is a nonsensical mess. Ouch. This is not a good sign, stinks of dangling pointers or similar.
In cases like this the actual error can be anywhere and it could be necessary a veeery long time to find the actual bug. It turned out that there was a problem not with just a pointer, but with a pointer to pointer to pointer to ... deep three or four levels.
I am soooo happy that I now code in Ada and not in C anymore.
Ha! I have a painful one. Took me nearly a good 3 days to find it. (between tackling other stuff when I hit a wall)
Note: I'm in GMT timezone.
The company I work for has an analytics view which takes a deep dive into the analytics of media the company serve. In November 2019, we got a message from a client saying numbers from our excel download functionality don't match that of their internal systems.
The numbers started off fine, but then massively increase after an arbitrary date. (clue 1)
The client was on the west coast of America, we provide all our analytics in UTC time (clue 2)
The client had multiple occurrences where the analytics was wrong after the arbitrary data. (clue 3).
I didn't have a problem when getting the data. (clue 4).
The problem?
Daylight saving
Without going into specifics.. the problem was going back an hour and then calling .startOfDay() on that date meant we would end up with two days worth of data after daylight savings.
I don't have a means to know, but I recall one which would have been around 1 month, but most of that time was ignoring the bug.
I had just come on to the project, the bugs had been mentioned but were not something I could directly start investigating.
I had to build out my test infrastructure, with mocks of our integrated component. This meant reading 3rd party API docs and building the correct communication lines.
After all of this was worked out, replicating the bug was easy and being specific with the cause was just a matter of describing what the code was doing.
I recently had a bug in my message stack, which took a few days to isolate what was happening, then took 2-3 weeks to fix, as I had to build a new message stack.
I have some nice embedded programming stories for ya:
Two old colleagues of mine spent about one week on a particular issue:
They were working on a SIP stack (for audio connections/sessions), when suddenly it stopped working completely. After one week it turned out that the PBX (kind of phone/SIP router) had blacklisted their device for too many failed calls)...
I have spent about 1,5 months on another issue with a driver for flash memory. TLDR: some bit in a settings register was not set/reset by our driver, so based on whether the device had used an older driver before it would work perfectly OR shift everything 1 byte.
The unfortunate part of embedded programming (at least back then) was that:
it took around 2 minutes to compile and flash ANY CHANGE that you had.
many errors show up when the linker gets involved, which is at 99% of the compilation process (so after 1 minute and 45 seconds, something like that)
especially in the beginning, none of us knew how to debug/profile embedded software.
Later on we added profilers and proper debugging setups (and a hardfault handler that printed stacktraces. GAMECHANGER!)
Mainly web development for money. Everything else for fun :) Rust, WebAssembly, Flutter, ML, C64 Assembly, Raspberry, ... a lot of plans, much less free time to work on them.
It was very-very long time ago, in the late 90's.
I wrote a little game in Watcom C (somewhere between Wolfenstein and Doom, only walls and simple floor, but with not only perpendicular walls). Trigonometric functions were very expensive, so I used a generated sin table. I copy-pasted it into the source, but it looked ugly, so I lined it up with leading zeros. It was a mistake, because 0****** number are octal in C, so very strange things started to happen on the screen. It took a few hours to debug this and after that I was literally banging my head into the desk.
Web Engineer. Working mostly with PHP, Symfony and Golang.
Entusiast about Engineering Best Practices, Continuous Delivery and DevOps.
Sports and FC Porto fan!
Once I was working on a project that used ElasticSearch. I was changing some things in a list page and noticed that the results were pretty random.
After maybe 2 or 3 days trying to understand what's going on, I discovered my local ES config had the default cluster name and open in the network, so it automatically created a cluster with a colleague machine and I was seeing his data.
I don't think it was the bug I spent more time, but it's one I dont forget.
Now that I think about it, it's pretty funny, but it was definitely not at the time. ;)
Latest comments (60)
In sum.: three weeks with 2-3 developers. It was a corrupted ponter in a medical device. The issue was really hard to reproduce and even harder to understand the root-cause. At the end we found a couple of threads which tried to release a pointer and only one implementation of those three threads was broken. And kt was a legacy codebase without support from the authors.
Depends on how you count, probably. Weeks to months I'd say, in terms of a quite peculiar persisetency bug bringing an application server to a screeching halt then and now for no obvious reason. Fixing this was rather trivial as soon as we actually understood what went wrong. 🙂
1 week and a half, trying to overwrite a css rule from a .net core app causing styling issues on a child react-based app. I needed the help from another dev for 4 days until we get the fix. Man... that was a challenge at another lever for me.
Three weeks. This was when I first started out in web development. I fixed a bug that prevented the project from building in heroku but I kept pushing to the wrong git branch(I used git push instead of git push origin master). So when I pushed again to heroku it would fail over and over again. I have never made that mistake again.
Over the years time scales have shrunk for both release cycles as well as debugging, I started programing on hard real time systems, using assembly code. Then we would normally achieve one (occasionally two) release a year. The system would was expected to be in service for a minimum of 10 years.
I can think of one intermittent fault on an interface between two systems which took me nearly a decade to find. This wasn't continuous effort but I had at least three attempts at resolving it. By the time I started looking at the bug the system was already a legacy system with a replacement contracted through our competitor wa on its way. I was a Junior engineer known for having an aptitude for low level coding, so I was put to work. Not having access to one of the two systems I could only review the code and write a report.
Two or three years later out customer dug up my report and agreed we could have access to both system, the catch was I only had a single day on site at the opposite end of the country. The nature of intermittent faults when debugging is the fault will not occur, true to form the system ran perfectly for the whole day and not useful information was gained.
As luck would have it our competitor failed to deliver on promises and our legacy system got a life extension and was rehosted on new hardware. I lead the software effort and in went in to service at which point I left the project. Once in service the original intermittent fault came e back with avengeance, our customer was not happy. I go seconded back to help fix the issue, we enhanced our simulator to emulate the other system and started debugging. Eventually I found the issue which we traced to using the wrong entry point in an error recovery routine in the Real Time OS. The programmer some 22 years earlier had types a 5 instead of a 3. The junior engineer who modified the simulator for me was younger than the bug! Having fixed the bug I was reminded of the report I had written nearly 10 years before, which correctly pointed to the exact error routine at fault.
For me, it was months.
I actually wrote about it:
The Magical Password
Jan Wedel ・ Aug 24 '18 ・ 3 min read
3 days, turned out there was a bug in PHP itself. We had to come up with some creative workarounds for that one.
A couple of days; an issue with pointers in a C program.
It began as usual: the program dies with a segmentation fault, open it with the debugger to check where the fault happened and... the stack is a nonsensical mess. Ouch. This is not a good sign, stinks of dangling pointers or similar.
In cases like this the actual error can be anywhere and it could be necessary a veeery long time to find the actual bug. It turned out that there was a problem not with just a pointer, but with a pointer to pointer to pointer to ... deep three or four levels.
I am soooo happy that I now code in Ada and not in C anymore.
Ha! I have a painful one. Took me nearly a good 3 days to find it. (between tackling other stuff when I hit a wall)
Note: I'm in GMT timezone.
The company I work for has an analytics view which takes a deep dive into the analytics of media the company serve. In November 2019, we got a message from a client saying numbers from our excel download functionality don't match that of their internal systems.
The numbers started off fine, but then massively increase after an arbitrary date. (clue 1)
The client was on the west coast of America, we provide all our analytics in UTC time (clue 2)
The client had multiple occurrences where the analytics was wrong after the arbitrary data. (clue 3).
I didn't have a problem when getting the data. (clue 4).
The problem?
Daylight saving
Without going into specifics.. the problem was going back an hour and then calling .startOfDay() on that date meant we would end up with two days worth of data after daylight savings.
Painful to find...easy to fix.
10 years. I still can't figure out why providing the wrong password on Windows login takes 1 minute to process
I don't have a means to know, but I recall one which would have been around 1 month, but most of that time was ignoring the bug.
I had just come on to the project, the bugs had been mentioned but were not something I could directly start investigating.
I had to build out my test infrastructure, with mocks of our integrated component. This meant reading 3rd party API docs and building the correct communication lines.
After all of this was worked out, replicating the bug was easy and being specific with the cause was just a matter of describing what the code was doing.
Being QA I sent it off for someone else to fix.
I recently had a bug in my message stack, which took a few days to isolate what was happening, then took 2-3 weeks to fix, as I had to build a new message stack.
I wrote about it in dev.to/mortoray/high-throughput-ga...
My game has encountered several major defects, usually in libraries or the browsers, which required a lot of effort to workaround.
I have some nice embedded programming stories for ya:
Two old colleagues of mine spent about one week on a particular issue:
They were working on a SIP stack (for audio connections/sessions), when suddenly it stopped working completely. After one week it turned out that the PBX (kind of phone/SIP router) had blacklisted their device for too many failed calls)...
I have spent about 1,5 months on another issue with a driver for flash memory. TLDR: some bit in a settings register was not set/reset by our driver, so based on whether the device had used an older driver before it would work perfectly OR shift everything 1 byte.
The unfortunate part of embedded programming (at least back then) was that:
Later on we added profilers and proper debugging setups (and a hardfault handler that printed stacktraces. GAMECHANGER!)
Good old arm-none-eabi-gcc days :P
It was very-very long time ago, in the late 90's.
I wrote a little game in Watcom C (somewhere between Wolfenstein and Doom, only walls and simple floor, but with not only perpendicular walls). Trigonometric functions were very expensive, so I used a generated sin table. I copy-pasted it into the source, but it looked ugly, so I lined it up with leading zeros. It was a mistake, because 0****** number are octal in C, so very strange things started to happen on the screen. It took a few hours to debug this and after that I was literally banging my head into the desk.
Once I was working on a project that used ElasticSearch. I was changing some things in a list page and noticed that the results were pretty random.
After maybe 2 or 3 days trying to understand what's going on, I discovered my local ES config had the default cluster name and open in the network, so it automatically created a cluster with a colleague machine and I was seeing his data.
I don't think it was the bug I spent more time, but it's one I dont forget.
Now that I think about it, it's pretty funny, but it was definitely not at the time. ;)