In these series I’m going to explain how to set up your workspace to accomplish Infrastructure as Code with Terraform, Kubernetes and Helm. This setup is based on my real world experience as a DevOps Engineer working with these techniques for over 3 years.
Concepts that these series will cover:
- Disaster Recovery and Infrastructure as Code
- Setting up a remote workspace
- File structure
- Storing secrets
- Setting up a Terraform project
- Deploying applications with Helm
- Backup and restore process
In this episode I’ll tell you some things you need to know about Disaster Recovery Plan and Infrastructure as Code. Disaster Recovery is the process of bringing your application back online and (partly) functional in any way possible, when a major outage has happened. So it is good to have a plan for that. Infrastructure as Code on the other hand ensures that the current state of the infrastructure is written in Code. Which helps a lot during a DR event.
There is one thing that a DR Plan and IaC have in common, which is reducing the mean time to repair during an outage. All outages are avoidable, but it still happens even to the best of us. Therefore you should not only focus on how to prevent an outage, but also on how to reduce the time it takes to repair it or go back to the previous working state.
In order to reduce the mean time to repair, it is important to have a clear overview of the changes that are made to the infrastructure and also the applications running on it. Next to server overloads, “changes” are the most leading causes of outages. Therefore this phrase: “Version everything!”.
Git is a useful and easy tool to track changes. In order to use Git, you’ll first need to manage your Infrastructure as Code. There are several tools that are really helpful in accomplishing IaC, like Terraform and Helm. I’ll dig deeper into these tools in one of the next episodes.
I’ve read that some Ops teams use Continuous Deployment (CD) for deploying Infrastructure changes. As this sounds like a good idea, there are some drawbacks to it.
The first one is that you don’t have hands on when things are starting to break and the changes that are being deployed are not always at the top of your mind anymore. This eventually will increase the mean time to repair the disruption. Next to that, how are you going to do complex maintenance, like a database migration through CD?
My personal preference is to always be at the buttons when deploying something, so in case it goes wrong you’ll have all the possibilities open to resolve it quickly.
Another thing that IaC will solve, is inconsistency throughout the infrastructure. Throughout my experiences I’ve seen a lot of times when there are two servers that should be identical, they eventually become inconsistent over time.
An important part a DR Plan solves is to have a clear process of how to restore backups. The process of making backups is something that is done a lot of times automatically. So you gain experience over time with it when it breaks and you’ll have to fix it. But the restore process you’ll hopefully never use. Nevertheless should it be clear how it is done. Because when you’ll have to use it, you don’t have much time to figure it out.
In the end there is one thing that is really important when doing Ops. In modern infrastructures, there are a lot of changes happening everyday, so a mistake that causes a disruption is not a rare thing. Mistakes don't matter and are inevitable, it matters how you respond to them. Therefore focus on the Mean time to repair.
In the next episode I'll talk about how to setup up your workspace to get started with IaC. Stay tuned!