David Truxall

Posted on Jun 26, 2021 • Originally published at davidtruxall.com on Jun 24, 2021

Breaking Production

#career

Recently, an intern at HBO Max mistakenly sent a test email to thousands of users in production. Twitter was on fire with memes and jokes, but I found the threads about developers’ experiences breaking production to be much more interesting. Most long-time developers experience the gut-wrenching experience of breaking production, and I’m no different. My story is more complicated than I could get in a Tweet, so I’m telling the story here.

The year was 2003. I was working at a company in the Detroit area migrating a classic ASP application to ASP.Net. The site “just had constant errors” that no one could elaborate on, and needed a thorough overhaul. The company decided to use the need to rewrite code as an opportunity to move to .Net since Microsoft was moving way from Classic ASP at that time. Unfortunately for me, there was no logging in the existing system, and all the exceptions were unhandled so my team had no idea what was breaking or when it was happening. To try and get a grip on the problem, we wrote code to hook into the global error handler and have the server send everyone on the team an email with the error details so we could start to understand the problems and frequency by getting real-time alerts. There were no Sentry.io/Crashlytics/LogRocket services or the like at that time, so we built our own. We were testing this feature, and had not rolled it out to production yet. It was only in our development environment.

The day was August 14, 2003. That afternoon, there was a severe blackout across the northeastern US. When it was clear the power was not coming back on, the company sent us home. For us the blackout lasted days. And this is where my failure begins.

The system I was working on was backed by an Oracle database, which had a dedicated administrator. The web servers had a different person administering them. During the blackout, the data center, which was located at our facility in Detroit, was running off of generators to keep services available to users outside the blackout area. The database administrator decided to shut down the development database server to conserve energy. But the web server admin kept the development web server running. Unbeknownst to me, the web server admin was running a tool that pinged the home page of the development web site every 8 seconds to make sure it was still alive. Unfortunately, the home page of the web site accessed the database, which was turned off. This caused an unhandled exception on the home page. We already know the site was not handling any exceptions. So the error fired the global exception handler and sent me and my team an email about the error. Every 8 seconds. For two days, because no one on my team was working during the blackout to see the emails.

We returned to the office when power was restored. No one in the company could get email though. That system was still down. Turns out it was down because of my code. The handler sent 56,000 emails during the blackout and filled the disk of the Novell GroupWise email server. Back in 2003 disks were no where near the size we use today. The email administrator was furious. No administrative tool existed for her to remove all those emails. My team had to sit for hours and hours and hours deleting emails using the desktop client. Which was limited to only selecting 100 messages at a time. We certainly did our penance.

You might be thinking that the circumstances of the blackout caused the problem, not really a mistake in the code. But the fault was mine. I should have considered throttling in the error handler. It should have stopped sending the same error message repeatedly. There is no value seeing the same error over and over and over, especially over a short period of time. This was of course the first thing I fixed after deleting all the emails. It’s a hard-earned lesson I’ll never forget.

I may have broken production at other times, but nothing as dramatic and difficult to recover from as this. If you are a junior person reading this, I can assure you that some day your code will break something. But you are not alone, it happens to all of us. I hope your breaks are minor and less consequential, but I know what you are feeling. Feel it and then take that lesson to heart and you’ll become a better developer.

Top comments (1)

Rodrigo 👨‍💻🤙 • Jul 6 '21

Funny story David thanks for share it. Well I guess it's funny now you're looking in retrospective

It demonstrates that no matter how many years we have worked, anyone can make a mistake

I also want to highlight that it's remarkable your professionalism. It was a rare event but you didn't use it as an excuse, that adds so much value to the story

DEV Community

Breaking Production

Top comments (1)

Read next

Docker with Helm: Simplifying Kubernetes Deployment and Management

Docker and Kubernetes Integration: The Ultimate Solution for Containerized Applications

Unlocking Advanced Docker Networking: Macvlan vs. Ipvlan

Docker Logging Drivers: A Comprehensive Guide for Effective Log Management