If you were hanging out here on DEV this past Tuesday night(1/7/2020) you might have run into a couple of minor interruptions during your web browsing experience š¬. Hopefully, not a lot of you since we heavily cache on the edge using Fastly which greatly minimizes the impact when, you know, we accidentally take down a couple of servers...
Wrong #1
It all started Tuesday early afternoon when Ben and I decided that we would rollout a branch that would switch us to using a new SiteConfig setup instead of ENV variables for a bunch of values we wanted to be more configurable. We decided to roll it out at the end of the day around 4:45pm.
As soon as the new code hit the servers I tried to fire up a console to check on things. Much to my dismay, the console couldn't even start up because the app was crashing due to this error You cannot use settings before Rails initialize.
I knew immediately that we needed to rollback so Ben and I quickly issued the rollback command to Heroku and waited for the servers to all rollback. Once the app was back up we got to work trying to figure out what went wrong. The first thing we looked at were the error backtraces and we figured out that the problem was with a SiteConfig value in one of our model validations.
validates :slug,
...
exclusion: { in: ReservedWords.all,
message: "%<value>s is a reserved word. Contact #{SiteConfig.default_site_email} for help registering your organization." }
This led us to Learning number 1 for the night.
Learning #1
Rails loads validations only once and ahead of everything else when it is booting up. For this reason, theĀ Rails docs come with this handy warning:
Note that the validator will be initialized only once for the whole application life cycle, and not on each validation run, so be careful about using instance variables inside it.
Needless to say the SiteConfig would not work in the validation because Rails was not completely loaded at the time the validation was being evaluated. The reason this didn't show up during testing was because it was triggered by having Rails eager loading enabled. This was a pretty easy thing to miss since we donāt use eager loading in our development or test environments, more on this later.
Ben and I did some investigating into the settings gem and found the error that we were seeing as well as the fix. The fix was to make SiteConfig.default_site_email
readonly according to the gem docs. We flipped on eager loading in our development environments and tested out the fix. While the app booted up without an issue, we quickly realized that it would break our ability to update any of our SiteConfig values.
Wrong #2
Given this fact, Ben suggested we go with the easy option and remove the SiteConfig from the organization model validation. That seemed like a no brainer and we were eager to be done with this so I removed the SiteConfig from the validation and we merged the new code.
Once again, the code hit production and we attempted to open up a Heroku console to make sure everything was working. However, just like before, the console wouldnāt boot up because Rails kept crashing with almost the exact same error we had seen before.
Yet again, we immediately rolled back the code and started looking at the error which was basically the same as before, except this time it was coming from one of our mailers which was trying to use the SiteConfig to set the default :from
field. This leads us to learning number 2.
Learning #2
Test even the simplest of fixes. When we were trying to fix the validation issue we tested what would happen when we set the SiteConfig to read-only. The fix was a success and everything worked. However, at the last minute, when we found out the read-only fix prevented us from updating the config variables we switched to what we thought was an easy solution. Because of the simplicity of the solution neither of us thought to attempt to reload the app with eager loading on AFTER we had removed the SiteConfig from the validation.
When you are working to fix a problem a lot of times you will feel the pressure to push a fix out quickly. And it might not even be from other people. When we were trying to fix this issue it was almost 7 pm at night and all Ben and I wanted to do was finish work and log off for the evening. The pressure to just be done coupled with the simplicity of the fix made us skip right over the crucial testing step. No matter how simple your fix is or what kind of pressure you are under to fix it you HAVE to test it. While it may mean a smidge more time now, it will likely save you from another 30 min of deploying and rolling back if your fix was indeed wrong. Ok, continuing on with the story!
3rd Times a Charm
After issuing another rollback we found 3 more places where Rails was attempting to use the SiteConfig variable before Rails was initialized. Our chosen fix this time was to revert to using the old ENV variables and then sort out the long term solution in the morning. You can bet after round two of taking down the site Ben and I were both testing the change this time.
Once we both confirmed it worked with eager_loading turned on we merged the change and anxiously waited. Finally this time the new code went out without a hitch. Afterward, we took a few minutes to debrief on what we had learned and how we could prevent this from happening in the future.
Config issues are easy to miss especially when you run your production app with a slightly different configuration than development and test. I immediately started trying to figure out if there was a way we could boot up the app in production prior to Heroku completing the deploy.
Learning #3
Enter Herokuās release phase! Heroku has a release phase that happens before an app is deployed that allows you to run basically any script you want. If that script succeeds the deploy finishes. If that script fails, Heroku will halt the deploy and alert you of the failure.
We were already using the release phase to run migrations so I thought why not use it to ātest boot upā the app? I ended up writing a simple bash script that would run the migrations and then issue a simple rails runner
command.
#!/bin/bash
STATEMENT_TIMEOUT=180000 bundle exec rails db:migrate && rails runner "puts 'app load success'"
The Rails runner has to load the entire Rails app in order to execute the command so its the perfect way to test that a configuration issue wonāt cause an un-bootable app to get pushed out to production.
Final Thoughts
Before Ben and I logged off for the night we wrote up a quick summary for the rest of the team and dropped it in Slack so those in the morning would understand what had happened and why the site monitoring alerts had gone off.
All in all, it was definitely not one of our proudest moments but I couldnāt help but feel very satisfied afterward. Despite the downtime, we now have a process in place to prevent this kind of issue from taking us down in the future. As an SRE, the most ideal outcome after an incident is coming up with a way to prevent it from happening again. In this regard, our new pre-deploy script is a huge win.
This incident was also a good reminder to slow down. When you are whipping out lots of little code changes here and there it can be easy to get a little too relaxed. An incident like this is a great reminder that no matter how innocuous the code you are pushing out may seem, remain diligent and test it first!
To those who were interrupted Tuesday night, I am sorry. But I wanted to write this post so you would all know that we did not waste this incident but rather we learned a lot from it!
Top comments (11)
Seeing this cover image in my home feed gives me a panic attack
lol I also had a small heart attack when I saw it for the first time š Maybe could have gone with something more subtle...
Nice fix - maybe you could go one step further and run a small E2E test suite in the same phase as the migrations and the new thing? Even just calling a āhealth checkā endpoint that pings the app and its dependencies such as DB, cache instance etc to make sure all is well would probably save you future headaches :)
Nice article, but I think learning #1 should be: don't rollout at 4:45pm :D
That's probably true, but technically DEV is an asynchronous and international organization, so specific times don't always matter.
Searched for āsiteconfig, railsā not disappointed š
hello ma'am question, so from development you push or merge directly to production? no staging environments?
Correct, we do not have a staging environment. In the past, it has not gotten a lot of use so it was hard to keep it up to date. In the future, we may revisit having one.
ohh. We have a staging environment, only for internal team can use it and invited users. before it goes to production it will be push to our staging then after 2 weeks with no error it'll be automatically pushed to production. hehehe. just wanted to share. maybe you can share how you do it next time. thank you maam.
This seems like there is no staging enviroment and you are jumping fromndev to prod
Were you ordering pizza's and calling friends and family that you will be busy for the day? š