Deploying to Production 101: deploying safely and putting out fires

#webdev #devops

What can make deploying to production scary? Perhaps it’s your first time going through the process. Or you’re deploying to a service with high user traffic. Or your Pull Request introduces potentially risky changes. Check, check, check - these scenarios applied to my first major deploy. I was also hit by chaos monkey during it (more info on chaos engineering).

Things went sideways during my first major deploy to production, but from the experience, I learned how to conduct better deploys and gained a stronger understanding of what and how to monitor.

What is "deploying"?

Your API or frontend service on localhost:8080/3000 can’t be accessed by others if they go to that URL from their computer. It lives on your computer and can be accessed only by you.

Generally speaking, deploying is the process of taking your local development environment and making it publicly available on a remote host, allowing other users with the URL to access your application. The deployment process is continuous; as you make updates to your application locally, you deploy the new code to make it publicly available.

How to deploy?

Deployment processes vary.

I deployed my Flatiron School Ruby on Rails, React final project to Heroku with a few terminal commands.

At work, my deploys are a multi-step process in which I release new code in stages and to different environments, as opposed to releasing it fully to production right away. The environments/steps are:

Staging: A sandbox environment that is nearly identical to production. In this step, the new code is tested and monitored to ensure everything is working in harmony under a production-like environment.
Canary: A single production node. In this step, the new code is released to a small percentage of real users and monitored for any issues that might not have been caught earlier.
More subsets of production: In this step, the new code is released to more production nodes but not yet all of production. And (as you've guessed by now) it is monitored, monitored, monitored before released to 100% of production.
Production 🎉

The multi-step process is great because it’s a series of gates to catch any bugs before the new code is put in front of users. The deploy doesn’t run on its own, however. At each step, engineers approve if the new code gets to pass onto the next, so the success of a deployment process like this depends on understanding how to monitor and following best practices.

Best Practices

From my first major deploy that went awry, I learned the key things to practice in future deploys. Here’s my advice, which hopefully can be applied to your process.

(1) Let the new code “bake” longer in each deployment phase

At work, teams define how long you must wait on staging, canary, etc. before moving on to the next step. We call this wait period “baking”, in which you see if any errors manifest from and/or alarms are set off by the new changes you’ve introduced to the application. Once you’ve waited the bake period and verified that all is well, you can proceed to the next deployment phase. I like to wait an additional 10-15 minutes in each step for extra monitoring and QAing to make sure I feel confident that the new changes are safe to roll out to production.

(2) Monitor, monitor, monitor, and then monitor some more

A successful deploy is hinged on thorough monitoring, and it starts with understanding how the code you're going to deploy interacts with the rest of the application and what kinds of errors/issues you want to be on the lookout for. Also, it’s important to get comfortable with your team’s monitoring tools and understand what purpose they serve. For example, the tools I use during deploys fall into two investigation modes: is something broken? how is it broken? For the former, I look to Grafana for a big picture overview on whether the system is healthy and changes to key stats (ie. number of 5xx responses). For investigating how something is broken, I use Kibana to look at individual logs and do queries (ie. filtering logs by whether they are classified as a warning, error, or critical) to trace where in the application the issue is happening.

(3) Actively communicate with teammates when there is an issue
Seeing your deploy go on 🔥 can be scary, but everyone has gone through/will go through a problematic deploy. Alert your team if something goes wrong and include any helpful information in the message, like links to graphs or logs that show what/where the issue is. Get confirmation on what to do next if you’re unsure - pause your deploy? rollback? An application is owned by a team, and everyone shares the responsibility to keep it healthy and running.

Anything can happen during a deploy—related or unrelated to your code changes, which is why I find the process intimidating. But learning how to make the deployment process in itself a safeguard has made me feel more equipped to conduct deploys successfully.