Everyone has heard of Netflix's famed Chaos Monkey and are familiar with the general concept of "Chaos Engineering". In terms of Site Reliability, Fault Tolerance, and High Availability, Netflix is at the top, with an application and environment that could be considered beyond the reach of "mere mortals". However, there are small steps you can take as part of your DevOps development process to begin your journey to developing more resilient applications.
A few weeks ago I accidentally destroyed a development environment. I'd made some manual changes to an AKS Cluster when I was familiarising myself with the a feature. I'd updated the Terraform template to match the manual changes I'd made, then run it with
-auto-approve. I watched as it made the determination that the state didn't match and resource re-creation was required, I hit
ctrl+c a second too late and the AKS Cluster started to delete itself; I was too late. For those unfamiliar with Kubernetes, there are two parts to orchestrating a Cluster: creating the cluster and underlying infrastructure, and configuring and deploying services to the cluster. The reason this was so immediately upsetting was that I'd done a lot of work getting the
nginx-ingress deployments working on my cluster, and the prospect of going through the whole process again was daunting. As I stared at the screen, I got a message on Slack: "when will the environment be back up, we need to begin implementing Application Insights."
Luckily, I've been in the DevOps game for a few years, and while I hadn't committed the Kubernetes configuration to git yet, I'd saved it all as I was working through the process. I responded on Slack that there had been a few issues, and it'll be an hour. I rolled up my sleeves and waited for the AKS Cluster to re-create. After twenty minutes, I was looking at a newly formed Kubernetes cluster, with the cluster prompting me to download the
kube-config file. I connected to the cluster and created the
nginx-ingress service and configured
cert-manager. The namespaces appeared and the pods seemed to be running, so far so good. I deployed the mock middlware server, and waited for the pods to deploy, the ingress rules to create, and the SSL certificate to be issued. After five minutes, I browsed to the address and it was working, SSL and all. Finally, I deployed the application and it just worked! I actually couldn't believe it, the last time I'd deleted something in production I spent two days rebuilding it (that was about six years ago, go easy on me).
This thankfully short - yet stressful - exercise had created a paradigm shift within me: before I deploy anything and hand it off, I need to delete it first. If I can't re-create it, then it isn't ready for use. It's that simple.
One of the biggest problems in the technology industry-at-large is the culture of mistaking "Proof of Concept" for "Minimum Viable Product". The key word here is viability; how can an application that can easily fail be considered viable? How can unrecoverable product-destroying actions be considered viable? The unvarnished truth is that they aren't viable, and it's only through sheer luck that many businesses make it through the initial shakey phase of initial deployments without going out of business. The problem of course is that these dodgy deployments become the standard, and then inevitably years on they're still as susceptible to critical failures as they were when the first "Hello, World!" was pushed to production.
One of the reasons we get into these situations where an environment may be unrecoverable is because it's so complex. Years of spaghetti deployments, modifying code on production servers, and critical incidents leave houses of cards that is ready to fall over at the smallest breeze. Where can you possibly begin? The answer: start small. The next time you deploy a new application make it as automated as possible. Delete and rebuild it before you hand it over to the customer. We don't all have the luxury of green field deployments, but the pace of IT means there is always at least something new. If you support a large application, just trying moving one of the services onto automated infrastucture. Just start, and piece-by-piece your stress will decrease and your environments will get easier to administer.