DEV Community

Discussion on: Tell me about a time you messed up

Collapse
 
kd2718 profile image
kd2718

The site I was working on for a real estate company was having issues when users requested all photos for a house. The server would fetch all the photos and add them to a zip for download. This was a bad idea from the start as there would often be over 100 high quality images.

The worst part was this was a synchronous task. The users would stare at a blank browser until the request finished or timed out. They had me make this async with celery (python). Even though the whole process was bad, we settled on this solution as a "quick fix". Celery was already used in other parts of the site.

I made the changes and deployed them near the end of the day. It worked fine. The next morning I was woken up by an emergency phone call. Most of the site was no longer working.

I had forgotten to disable the download button when a zip job was in the queue. Apparently people were mashing the button expecting the old behavior and there were a massive amount of jobs backed up in the celery queue. Anything else using celery was basically broken on the site.

I had to use git rollback to revert to the previous version of the site. I felt horrible. I was chewed out by the owners and told that they lost "millions". I guess it wasn't that bad because they kept using us for a while after that...