Why Your Cloud is Broken

#devops #it #software #cloudnative

The Promise of Desired State Configuration

Last decade of IT was marked by a gradual proliferation of Desired State Configuration Management practices. Pioneered by Mark Burgess’ CFEngine back in 1993 and further developed by such tools as Chef, Puppet and lately Kubernetes - this once revolutionary approach allowed IT administrators to automate the management of the ever growing fleet of servers - physical and virtual.
All Desired State Configuration systems are based on another now widely present practice of IT management called Infrastructure-As-Code (IaC). The idea is that we (the system administrators) describe the Desired State of our system in code (usually some kind of a DSL - Domain Specific Language) and the configuration system makes sure that the actual state of the system reflects the desired state. This automated process of bringing the system to the desired state is called convergence or reconciliation and is performed by the configuration system controllers and agents. The controllers publish and verify state transitions, while the agents take care of the actual state application.

This is of course a very simplified, nutshell-contained description, but I believe it’s sufficient to understand the basic problem inherent in this pattern.

The Actual State

And the problem is - all these systems are based on the assumption that the administrators of a system know what the desired state of the system is.
But this couldn’t be further away from the truth. In reality - any even moderately complex cloud environment has loads of various components that the administrators have a very vague understanding of. These components are often misconfigured or under-optimized. The configuration blind spots only get discovered when something in the system crashes.

As an example: here are 3 incidents that occured at one company we recently started consulting for over the course of one week:

An auto-scaling event in message provider causes a message queue to explode
Code is deployed pointing at a dummy instance of a downstream service. Stays undiscovered for 4 days.
A database that was never configured for HA (or auto-scaling) runs hot for a week without any alerts until it finally goes up in flames.

An attentive reader will notice that all of the mentioned systems were in a desired state defined by the system administrators. It was administrators who configured the auto-scaling, pointed code at a dummy service or decided HA configuration for the DB is not currently needed. Or maybe they didn’t consciously decide all of these and just went with default configurations (which are never meant for production, are they?) . Why? Well - because they never had the time to specify what the desired configuration of the system is. Because the complexity and variety of system components we have to manage today is too much of a cognitive load for a human or even a team of humans to handle.

The Adaptive State

Putting aside the complex modular stacks our IT is composed of - the only desired state of the system we can truly define is this:

“Our System Works and Serves Its Customers”.

It seems like a naive approach at first. But isn’t this the top-level business objective of any information system?

And that’s exactly why Desired State Configuration doesn’t cut it anymore. Because it focuses on defining the state of infrastructure instead of functional goals. What we really need in the age of complex flexible information systems (aka Cloud Native IT) is Adaptive Configuration - i.e smart controllers and agents that can configure (and continuously optimize) the components of the system according to its business objectives as defined by us, aligned with industry best practices, and supported by continuously collected machine data.

We’re already seeing the first glimpses of this approach and (quite unsurprisingly) - the first conflicts resulting from these newer smarter techniques clashing with the Desired State Configuration patterns that are still present.

I’m going to outline the approaches and the conflicts we’re seeing in follow-up posts. And of course I’m very curious to hear about your experience with the Desired State Configuration approach and where you see its pros and cons manifested. Looking forward to your feedback!

Stay tuned, stay adaptive, stay well!

DEV Community

Why Your Cloud is Broken

The Promise of Desired State Configuration

The Actual State

The Adaptive State

Top comments (0)

Read next

Day 01: Introduction to Terraform and Infrastructure as Code (IaC)

Building a Dynamic Emotion-Based Playlist Generator Using Python and Daytona (TuneTailor)

Automating Security Hub Findings Summary with Bedrock, Slack Notifications, and Zenhub Task Management

Most Commonly Used Docker Commands!