DEV Community

Discussion on: What was the worst bug you've ever written?

Collapse
 
evanplaice profile image
Evan Plaice

Network communication bug on a custom ping-pong handshake protocol caused by stale state in an array overflow.

For a little perspective. I was writing a graphical multi-touchscreen front-end for a 80's era commercial flight simulator. The sim host ran a early 80s era realtime AIX environment. Booting the damn thing took about 15 minutes and involved following a 'bootstrap loader' procedure whereby you'd have to fat-finger the boot instructions into a keypad in hexadecimal. Ethernet networking didn't exist when it was built so we had to have an ethernet card custom built. On the software end we were working with an AIX specialist from the UK who designed the protocol, implemented a bare bones ethernet driver, and provided a client/server implementation on the host end.

Our AIX guy had done this before on other similar machines but it was my first journey into networking programming. I wouldn't have physical access to the host until integration so the AIX guy gave me a mini client/server simulator app written in C that I could train my client/server implementation against.

My end was entirely written in C#, where trying to do layer-2 networking is difficult enough. Either way, I found a PCAP wrapper (ie SharpPcap)to hack together a networking protocol, translated a floating point format converter (Ie the host didn't use IEEE 754) from C, reverse engineered a raw dump of the symbol table and loaded it into a DB, etc...

Integration finally came and I was a nervous wreck. We had 7 full days of downtime to make the changes, including ripping out the old hardware (ie ray-tracing displays + 100+ buttons) and replacing it with the new (3 touch screens). Flight simulators are the backbone of an airline, pilots who can't keep up with the FAA requirements for required training are grounded until they can. Therefore, flight simulators typically operate 20 hours/day, 7 days/week. Downtime beyond our scheduled window was not an option.

Months of preparation and death-march sprinting was finally yielding results. Everything was working brilliantly to the point where we could even start identifying opportunities to make adjustments and performance improvements.

That was, until somebody noticed something strange happening on the interface. Instead of updating to show the current state of the host as expected, a small select number of labels were constantly toggling between the correct value and something else. The symptom only occurred on a few pages, and only when the pages were loaded in a specific order.

I racked my brain for hours chugging coffee and pouring over every detail of my networking code until at about 4AM. That's when I finally managed to catch the AIX specialist taking a break from his VT220 terminal. I picked his brain, going back over the the specifics of the networking protocol spec. When that failed yield results, I started picking his brain about how he implemented the protocol on the host. That's when I had a sudden 'lightbulb' moment.

It turns out that, on 80's era hardware running under real-time constraints, re-initializing the state array on his end for every update (ie roughly every couple 100ms) is a very expensive operation. To avoid the performance penalty, he initially allocated the array to a fixed maximum size, and wrote over the existing values with the new values on each update.

On my end, running C# on modern Windows hardware, array initialization is the cheap and 'safe' approach so that's exactly what I did. Since, I wasn't sending fixed-length set large enough to blank out values outside of the new set, it was possible for stale values to persist in the overflow of the array.

If the old label existed in the overflow (ie which wasn't set to update), and the same label was present in the new set of values (ie which was set to update), the UI the value would read either the old or new at random.

This would only occur in under very specific conditions. The set of current labels had to be shorter than the previous set of labels. The same label had to be present in the current set and in the overflow. The overflow would persist as long as the size of a new set of labels was shorter than the set of labels the old label was contained in.

Through sheer luck, I managed to get the code patched and working before I left for the hotel. Up until that point, I had worked mostly on the hardware side. I even planned and installed all the hardware for the update before switching back to code.

It's not like this was my first 'aha' troubleshooting moment but it was the first time I had my first full Boris-esque 'I am Invincible!' moment. TBH, I've been hooked ever since.

Collapse
 
alephnaught2tog profile image
Max Cerrina

Honest to god, reading that was a hell of a rush. Write more! That was amazing.