An Engineer’s Rite of Passage

Molly Struve on January 12, 2019

It is a rite of passage for every engineer to take down production. Whether it be a full blown 500 page being served to all users or breaking ba... [Read Full]
markdown guide
 

My worst production outage was accidentally adding code which redeployed the application upon boot. On this very website. 😄

I added some code in a Rails initializer file which pinged the Heroku API to change a config variable on boot. I didn't really think through the whole thing because every time you change a config variable, the app redeploys and restarts. The code was written in such a way that it only executed in this way in production, so we had not caught it earlier.

Enter the infinite loop.

Nothing we could do would stop the loop. The app just kept redeploying over and over again and nothing would work to stop it. We couldn't push new code, we couldn't figure anything out.

status.heroku.com showed yellow indicating something was going on with the system. That was because of me.

Eventually we figured out we could stop the problem by revoking my account's privileges within the app on Heroku—But shortly after that, Heroku suspended our whole organization account. dev.to was no longer being served.

We got some people on the phone and got the account restored and back online soon enough after that.

That was a day of learning.

 

Great story! Thanks for sharing @ben ! That was some innovative problem solving to revoke your account privileges to fix the issue. I always marvel at how innovative our team gets with solutions when our backs are against the wall. Feels like the pressure tends to really make us think outside the box to get things done.

 

I just checked and the most recent site-wide outage I caused was back in March 2018. My Slack message at the time read:

we had a ~3 minute period at 9:30 EST when some users might not have been able to access the app or storefronts. It was caused by a bad deploy and has been rectified

IIRC, it was caused by either a missing application key in the production environment or a badly-formatted YAML. I know I've done both.

I've been a professional developer for 20 years so it's not just "newbies" that do this. In fact, if you're always growing and learning then you're always a newbie at something.

 

OMG those pesky YAML files! I have definitely had that happen to me before. I added a cron string to one without quotes. Took down our background workers for a few minutes. I immediately put a test in to validate that YAML file and it hasn't happened since. Plus, that test has actually caught a few errors.

I've been a professional developer for 20 years so it's not just "newbies" that do this. In fact, if you're always growing and learning then you're always a newbie at something.

Could not agree more!

 

Great post. More folks should be sharing these kinds of stories. We are all human and we make mistakes. It is going to happen, but we can always learn from them. I am glad to see your boss reacted so well and helped you through it. It's also great to hear they put blame on themselves as well. That is not common for a lot of people who go through situations like this.
Also, great use of GIFs. Especially that last one. It fits perfectly. :)

 

Thank you! I am very lucky to work in such a supportive environment. Some people don't have that and I am hoping others will share some of their stories so that everyone can realize we are all in this together and downtime is just an occupational hazard.

 

I don't have an interesting story to share, but here are some of my general tips for not breaking production. Hope some of these are helpful. I'm sure there are plenty more, feel free to share yours.

  • It starts with writing good code. This can mean many things depending on the person and language, but my general rules are:
    • Have consistent styling or follow your team's style guide. I find this makes it easier to see when something is out of place during development.
    • Keep things simple and clear in your code. When problems arise, you may not be thinking straight. If your code is too confusing and unclear, this may only compound the problem. Other developers may have a difficult time helping get things online if they cannot decipher the code.
  • Test locally, on a dev server, then on production.
    • Do not just run automated tests or test locally, but test on a development server if available.
    • Once your change is deployed, test it on production.
    • Test small changes too.
  • Have someone review your code before going live with it.
    • Have them test it.
    • Make sure they actually review it and don't just give the go ahead.
    • Create some guidelines around this with your team if none exist.
  • Don't write/run queries directly on production.
    • Write and test them locally or on a dev server. After running them in a testing environment, make sure the updated data looks correct in the final product.
    • If it is an update or delete statement, write a select version of the same query first. This will help ensure you are pulling in the correct data. This will also help in the next step.
    • BACK-UP THE DATA. If you are unsure how, this can be a simple select statement, copied to a spreadsheet, and uploaded somewhere (as opposed to leaving _temp tables cluttering the DB).
    • Again, have queries reviewed by someone before running them.
    • If you are new, you should not have production database access on your first day. If you are in this situation seek out senior members of the team to verify and help run queries with you.
  • Do not push changes towards the end of the day or before the weekend.
    • Save yourself the trouble of having to scramble to fix something during your personal time or letting the problem continue while you are out of office.
    • Push things live in the morning while everyone is in the office.
  • Don't beat yourself up over it.
    • Development is hard, every project has a lot of different things to worry about, it happens.
    • Learn from your mistakes and help future developers avoid them as well.
 

Do not push changes towards the end of the day or before the weekend.

My cut off for the day is 3:30pm! Unless its an emergency, I won't merge a PR after that.

 

I was very lucky that in my 10 years, I only once turned off a production server via SSH thinking I was on my computer's terminal. The server had IPMI so it was down for about five minutes. Now I can tell the usefulness of the prompt.

It really scares me a lot to not have had more major problems in my career, it makes me feel like I am probably over-confident and once I will screw up, I will screw up big! For my defence, I read a lot of articles about good practices and ALWAYS ensure I have backups.

 

I think the best thing you can do is not be afraid of when/if you make a mistake and it leads to an outage. Rather than fear it know in the back of your mind that it is part of the job and when it happens don't let it define you, let it shape you and learn from it.

Also +1 for best practices!!!!

 

I've once wrote innocent looking code to invalidate the cache on module change,
but who knew that this module been changed in the loop on API calls, and cache invalidation wa making not piped udp call to Redis.

Long story short it took down the system. It wasn't fun...

 

Oooof, Redis is always tricky! We once had an engineer do flushdb on one of our Redis databases to try and fix a bug. The missing caches in the middle of the day caused our site to be unusable so some of our bigger clients, it was a scramble to get it fixed. We have since put in place some safety features like read-only consoles and alerts for missing caches. As long as you are learning from these experiences then they are not a waste 😊

 

yep, so the awesome thing with Redis is an ability to pipe commands,
so the solution to my problem was collecting cache keys to invalidate and then in separate call
making a pipe command to redis to invalidate them in bulk.

pipe ftw :D

 

On my first dev job, I was working on this sales site, and I made a change to the Thanks E-Mail, which gets automatically sent to the customer once he/she makes a purchase, and broke it, and I didn't notice.

So... for, like, 24hs the mail didn't get sent and customers got confused, started buying things again and again believing the transaction didn't work because the mail wasn't delivered.

My team had to code a job to re send the failed E-Mails after the template was fixed, and correct the duplicated purchases customers had made.

One of my teammates got really mad, but didn't hear anything from my manager at the time. Some reassuring words would have been nice. Now I look back and laugh, but at the time was really awful.

 

Ooof breaking background workers is always rough bc you usually don't notice it right away and when you finally do, you have a mess to clean up 😝 Been there, done that!!!

 

I take down production occasionally. Even last week. It doesn't help that we're so cash-strapped that we can't afford the usual test/production environment. I often debug in production. This is profoundly sub-optimal. Maybe it helps that we're Australian and so really good at doing a lot with a little. Maybe it also helps that we're ever so slightly crazy.

 

Now that is a hell of a first day!

 
 

Your boss sounds like a nice guy! Lucky to have people like that in management! Thanks for sharing :)

 

Another great one from Twitter!

 

ProTip: If you break production and feel bad about causing extra work for others, beer makes everything better 😃

 

Some great response this got on Twitter!

 
code of conduct - report abuse