Discussion on: An Engineer’s Rite of Passage

View post

I just checked and the most recent site-wide outage I caused was back in March 2018. My Slack message at the time read:

we had a ~3 minute period at 9:30 EST when some users might not have been able to access the app or storefronts. It was caused by a bad deploy and has been rectified

IIRC, it was caused by either a missing application key in the production environment or a badly-formatted YAML. I know I've done both.

I've been a professional developer for 20 years so it's not just "newbies" that do this. In fact, if you're always growing and learning then you're always a newbie at something.

Molly Struve (she/her) • Jan 14 '19

OMG those pesky YAML files! I have definitely had that happen to me before. I added a cron string to one without quotes. Took down our background workers for a few minutes. I immediately put a test in to validate that YAML file and it hasn't happened since. Plus, that test has actually caught a few errors.

I've been a professional developer for 20 years so it's not just "newbies" that do this. In fact, if you're always growing and learning then you're always a newbie at something.

Could not agree more!