We learnt it the hard way. 👇
Last Friday afternoon, we bumped the
Rails version from 5.2.3 to 5.2.5. We had to do this because one of the dependent gems of rails had been yanked very recently and rails quickly released a patch update to address this issue.
We meticulously went through the changelog because we didn't want this to break our system especially since it was on a Friday afternoon. (We have a history of Friday deployments causing outages :D )
Changelog said, good to go.
We went ahead and prepared for the release.
Things looked alright on our staging(test) environment and we deployed it to prod. Our production deployment pipeline failed. (Eyebrows were raised at this moment)
Because until that point, deployments were going through without any hassle as Docker had been using the cached version of the yanked gem. Since there's a failure now, the cache is gone and we can no longer push any further deployments without fixing this. 🤦♂️
Funnily, the release also had a
log to one of our background processes and I just checked if the latest code was there on the new pod. But what I didn't notice was that the application pods were crashing. Sanity was done on the previous release 🤨
I tried making some change to the buggy release and pushed it to the stage and we found the issue on staging now.
Application pods weren't getting up because the health checks were failing. (We did not have any health checks for our background job pods)
- No further deployments can be pushed to production
- Our staging was down.
Shoot! Almost everyone was blocked in one way or the other.
At first, I thought this was an infra issue. But soon I realised, the health check request wasn't even going through. The API was broken. 🤯
We were also using a gem called
grape for APIs and health checks were going through that API.
Yes, Grape broke!
Wait, a patch update of rails that had almost nothing in the changelog broke grape? YES!!!
Rack - it was bumped from 2.0.7 -> 2.2.3 (We missed this as there were a lot of dependent gems that got updated)
Rack is the middleware that forwards the requests to either grape or rails API. The response that it sends over had been changed (god knows why) and grape wasn't yet ready for this. The cascading effect was that all the grape APIs were failing including the health checks and our system was down.
We now had no other option but to update grape to the latest version and hope that it fixes this issue!
Thankfully, it fixed the issue.
Had it not fixed the issue, we would have been forced to move all the APIs out of grape to rails API.
Just the thought of this made me claustrophobic because that would not only ruin my Friday night but also would have consumed my weekend!
Lucky escape indeed!
Though it was pretty scary when it happened, I would take this learning on any day.
Lessons learnt ✅ fortunately without any major outage. 🤞
PS: This post was originally tweeted as a thread here.