Gather 'round and let me tell you about that time my teammates and I took down one of the platform's most critical services.
For the sake of narrative, I’ll name my coworkers S (senior) and M (middle), although I changed the dates and some of the technical details in an effort to keep things as anonymous as possible.
T’was a lovely Thursday afternoon at the end of July. Weather was milder than the days before and my teammates and I had just moved into a new space in the office.
Life was mostly good.
The week before had passed without the presence of our most senior dev,
M and I had managed to complete some tasks on our own.
The changes we introduced had some pretty heavy stuff (new database fields, new endpoints) so we decided to wait for S's review before we merged them.
In the meantime another dev reviewed our changes and mentioned that the operations in our migration scripts could be a lot of work for the database server and could keep it locked for a while. So he suggested some alternatives.
If you don’t know what a database migration is, it’s a series of SQL scripts that — when executed sequentially — create the whole database schema from scratch (tables, columns, default values, constraints, etc.). Usually they run on every deployment, but only if there’s something new to apply––that is, a new migration script that has never been run.
We didn’t want to run our change as a migration because then it would be executed by our continuous integration system which would be completely out of our control if something went wrong.
We had to run it manually.
So, we sat down with a replica of the production database and started playing around with queries trying to find the most efficient alternative.
This was a fun while, the 3 of us trying to guess what would happen when we ran each script line separately. Would it take long? Or run instantly? And why? Where should we set the default value to prevent new entries from breaking it?
Now it was 5 PM and we were finally happy with our query. It did everything in the right order and didn’t really lock down for very long. We sent a ticket to the infrastructure team to run it directly on our servers, since obviously we didn't have credentials for that. You’ll see exactly why in just a moment.
Infra was really diligent about this. We know they’re a busy bunch, and it’s okay if they take a while to do stuff. But not this time, oh no, this time the ticket was closed before I could blink twice.
Everything seemed ok. It was almost the end of the day so we split into our own desks and finished some tasks.
Some minutes later, a member of another team came by and asked me if we were changing stuff on that feature because it was throwing some errors. This hit me like an ice-bucket-challenge:
Oh no, I mean, yes… we f***ed it up.
Can you guess what happened? I know it’s harder to find omissions than to find mistakes, but I think you could be able to know what we needed to do (and didn’t) without knowing the service, the feature, or the queries.
If you do, scroll to the bottom and comment it, I’ll trust you’re not peeping below this photo 👇
Alright, if you have no idea what could have happened, that’s OK too. I feel I should have known, because this happened locally during development like a million times. And I had been starting a very similar feature on my own that day and dealing with the same error.
Ready for it?
We added the column in the database, but we never merged the code that handled this new data. This resulted in the application not knowing WTF to do with those fields, and throwing errors on all GET requests 😱
Now, this might not have happened in some languages, which probably would have discarded the extra, unparsed data and moved on. I shouldn't disclose any more information about this, but I’ll say that we were using a technology that’s not as chill about rogue database columns, it demands that you tell it how to handle everything it receives.
Not merging the code for the new fields wasn’t the problem though. Here’s where our little evil villain comes in: meet
* 😈 malefic, echoing laughter
See, all of our retrieval queries were using the
* (known as an asterisk or star), which means results will include every field on the table. This was the day I learned this is a very very very bad idea.
If we had had a list of the fields to retrieve instead of
* nothing bad would have happened. But we didn't. And all 3 of us knew it. It just slipped our minds 😰
OK but, we’re humans, right? We make mistakes. No-one can keep all of the codebases in a company in their working memory. This is why we have tests. Not because our code is fragile and error prone, but because our minds are. And it’s our minds that create the code.
So the problem wasn’t really that we broke it. The problem was that no alarms went off.
We had production tests, but they weren’t notifying anyone. We had error alarms in the logs, but they weren’t monitoring the thing that caused the most errors. We had end-to-end tests, but they were disabled and marked as “broken”. This is what was truly unacceptable 🍋
Ultimately, though, it was a big opportunity for learning: you never learn as hard as when you f*** up. That’s a lesson that never leaves you.
So, for you children out there, here’s my list of learnings from that day:
- NEVER, I MEAN NEVER use * in a query in your code. Even if you have to write one hundred column names, it's better like that.
- Be scared of database changes, may fear guide you into caution.
- Migrations should be decoupled from deploys, in order to provide better control over them.
- When you introduce a database change, make sure all the code is already in place and that it can handle the previous db state, and the one after the change.
- Have happy-path tests in place running on production for your critical features, working and notifying the right people.
- Test database changes as soon as they are impacted and monitor the logs for a couple of minutes later (depending on the amount of traffic that you get).
- Aaand brush your teeth before you go to sleep 😬
This is a story told by the less experienced and knowledgeable of its characters. If you have any input about it, to add to or correct what I say I would really enjoy reading it, but please be nice about it. 🙂