It's happened to every back-end developer at some point, especially if you've had to work with deployments directly. You've made some changes to your APIs or you've updated some kind of functionality in a monolith and you're ready to deploy. All of your unit tests pass, you get through integration testing, and even all the way through QA. So when it's time to push all of your changes to production, you don't see any problems so you start the CI/CD process and let it do its thing. Then the worst scenario happens.
Somehow this weird, half-built deploy gets into production and users start having problems. In many cases, our software is the product that drives profit for our companies so if production is down the company is losing a lot of money. Now it's up to you and your rag-time team to figure out what the heck happened and how you can get prod back up. This is a little checklist I like to rush through when something happens with production and I don't know why. Hopefully it's useful for you. 😊
The most important thing is getting production back up and running. If you have a copy of the build before your deploy, go ahead and get that up. Your changes won't be there, but this will be a stable copy of production that you know works. This is why having a CI/CD pipeline is so great. Even if production does go down after a deploy, you can still get it back up with a few clicks.
If for some reason you don't have a recent back up available, you can try deploying from a different branch assuming you haven't merged into everything yet. Now if that didn't work, you're getting into even shakier territory. Check and see if you or one of the other developers has a local branch that runs production. Usually someone will have a good copy sitting around, but hopefully it doesn't get to this point. There are other methods you can use to get prod back up so take some of these as last ditch efforts.
Once you have production back up, look into what exactly went wrong. Our web applications have a lot of moving parts so it can take a long time to track down problems blindly. Get users to write reproduction steps, check your logs, look at performance measurements around the time of the issue. There are a lot of ways to get to the root of the issue with relative speed.
Focus on what went wrong instead of trying to figure out who to blame. In the heat of the moment it might feel like an important question, but not for the reasons you might think. The only concern right now is figuring out what the problem was, not who caused it. You'll probably track it down to a weird pull request that got through or something unusual going on in the production environment. So keep in mind that the blame game doesn't help with debugging.
All of those little things that you notice while you're debugging are important to document. Nothing is worse than seeing something you need to fix later and then forgetting about it. Keep a running list of stuff you encounter. In the moment it doesn't have to be any kind of formal system. You can write down notes or put a bunch of sticky notes on the screen. Pick something that won't interrupt your debugging flow very much.
The problems you need to track also include issues you had with getting production back up. Pay attention to how quickly you could get a working version deployed and available for users again. If you don't have a CI/CD pipeline in place, this is your golden opportunity to show why you need one. When you have a record of everything you encountered during this outage, you have more information to build on that will prevent this and other stuff from happening. This is going to be one of your best opportunities to get real debugging information from production so take as much as you can from it.
Depending on how critical the changes you tried to deploy are, you might need to get ready to try and deploy again. This is when you need to go through the most methodical debugging you can. Now you're trying to see if the problem is a difference within the production environment, if the problem is on the front or back end, or if the problem has to do with a third party API. Once you've tracked down the root cause of the issue and fixed it, test the crap out of it.
Deploy it to every environment you have available and see if it still works as expected. After you have gone through several rounds of testing involving different people, make sure you all are confident the problem has been resolved. This is the point when you can set a time to try and deploy to production again. You already have a restore process in place from the original problem so you know how to handle any issues.
These are just a few things you can go through when you're dealing with production issues. It's always stressful trying to get things in working order again. It helps when you have a checklist of things to look over whenever an issue like this arises. There's a chance it could be something you can pin down in an hour if you go through a consistent debugging session.
Are there particular things you look for when fixing prod? I'll take a look at stuff like CPU usage and logs to clear some easy things first. Then it's time to jump down the async rabbit hole or something. Do you have your own checklist for debugging prod?
Hey! You should follow me on Twitter because reasons: https://twitter.com/FlippedCoding