Change as the Vehicle of Value Creation
The value of a modern information system is defined in large part by the speed with which it can change and adapt.
“Adapt to what?” - you may ask.
Well, a lot of stuff - the wild and unpredictable fluctuations of market forces, changes in consumer demand and institutional regulations and the sheer cost of underlying infrastructure.
Change is the vehicle of value creation. But somewhat paradoxically - change is also the main source of value disruption - because of the instability it brings to a system.
The last 20 years of IT evolution were all about enabling higher rates of change while also eliminating its disruptive impact.
Enabling Change
Continuous integration (CI) practices were created in order to identify and address the potential disruptions as early as possible in the change lifecycle.
They allowed moving the bottleneck from the release point to the actual integration stages where it is easier and cheaper to resolve the arising issues.
The feedback loops got shorter and the situation improved. But pretty soon - this too became insufficient. The evolution of online services demanded increasingly tight uptime restrictions. The ability to roll out changes to production in a safe and continuous manner brought on the push for Continuous Delivery (CD).
But if only that was so easy..
Facing Instability
Organizations taking the leap towards continuous updates of production environments are facing numerous challenges. As information systems become more complex, it is increasingly (up to the point of impossible) hard to predict the impact of any individual change on a system’s performance. Especially so when there are numerous changes happening simultaneously on various layers of the technological stack. Some of them initiated by the system’s operators or users, others stemming from integrations with third party systems, still others - as artefacts of the system evolution.
If the impact of change is unpredictable, then how do we preserve the stability of value delivery? The obvious strategy is to limit the amount of change, disallow parallel updates and evaluate the impact of each change individually until stability and value generation is ensured - only then can further changes be introduced. This is a great approach for slowly-changing, highly observable systems. Not the dynamic cloud-native applications we’re running and using today.
In current business reality there is no other option but to move fast and break things.
But still, we don't want our customers to suffer and potentially leave us for a competitor because the service is broken. So we try to compensate for compromised stability with significant investments in monitoring and observability. We also do our best to hire qualified operations personnel for on-call rotation - so they can quickly fix the problems that arise. Hence the proliferation of monitoring software services and the severe shortage of Ops and SRE professionals we’re witnessing in the last decade.
It’s pretty obvious of course that just exposing more metrics and logs and putting more humans on call for resolving incidents is a model that doesn’t scale. The rate of burnout in our industry is already higher than in healthcare!
Have we really built all these smart machines only to waste our lives on watching them behave?!
Of course not!!!
Instead - a new model of smart, continuous incident prediction and remediation is needed.
And if we look around - we’re already starting to see the first glimpses of platforms, tools and most importantly - humans embracing these ideas.
The 5 Steps of Resilience
This is where I want to stop for a moment to talk about resilience. And specifically - resilience engineering. It’s a huge topic - much wider than a paragraph in a blog post could cover. So I’ll just mention that resilience of a system is pre-defined by its adaptive capacity, its ability to bend and morph in response to unexpected environmental events while still preserving the basic required functionality. While resilience engineering is concerned with:
- building systems capable of resilience and
- practices of resilience in system operation.
Resilience basically entails the system’s ability to identify potential destabilization factors as fast as possible, analyze the problem and its impact, enumerate possible ways of tackling the problem, iterate on all the possible solutions until a working solution is found and finally apply the solution. All of these with minimum adverse impact on the expected functionality of the system.
Therefore - in order to enable resilience we need our system to continuously cycle through 5 main stages of interaction with problems that we want it to withstand:
- Identification of the Problem
- Analysis of the Problem
- Identification of Possible Solutions
- Validation of Possible Solutions
- Application of the Most Viable* Solution
*Note: the viability of a solution is defined by organizational policy with regards to costs, time, quality and additional considerations.
Know Thy Enemy
In this post I’d like to focus on the first 2 stages. Without identification and analysis - no corrective action can occur. Moreover - the better we get at the first 2 capabilities - the better will our ability to adapt become. As Sun Tzu put it:
Know thy enemy and know yourself; in a hundred battles, you will never be defeated.
When you are ignorant of the enemy but know yourself, your chances of winning or losing are equal.
If ignorant both of your enemy and of yourself, you are sure to be defeated in every battle.
In the light of our discussion - identification and analysis deal with knowing our enemy.
Identification and Phase Transitions
In thermodynamics and statistical physics there’s a notion of phase transitions. A phase is a condition of a system in which its behaviour is qualitatively different from its previous condition. A classic example is solid matter melting into liquid and further evaporating into gas. Only to condense into liquid again, of course. A somewhat related phenomena in the study of fluid dynamics is turbulence. Described by Richard Feynman as “the most important unsolved problem in classical physics” turbulence is the onset of instability and chaotic patterns in previously smooth or laminar flow. So again - a transition to a different phase where the same system starts to behave differently, even though consisting of the same set of components. A transition is never immediate - it’s caused by gradually accumulating levels of energy (kinetic or thermal) - the energy keeps building up until the system reaches a critical point. This is the point at which even a tiny addition of energy can throw the system into the new, unstable behaviour.
During the transition - small potential instabilities start to unfold - islands of chaos in the stable ocean of predictability.
Quite in the same way our information systems don’t become unstable all at once. The constant influx of changes (that can be seen as energy) generates the chaotic islands of tech debt, security loopholes, circular dependencies and unfortunate misconfigurations until the system reaches its critical point and collapses into instability.
Identifying Phase Transitions in IT
And this brings us back to the identification stage.
Identifying a problem after it occurs is too often much too late. Going from chaos back to stability is nerve-wrecking and very costly. If we want to build a better problem identification capability - we need to measure and identify the phase transition processes in information systems. Which is totally possible in well-monitored, observable systems of today. In such a system we will be able to measure the current level of instability and adapt the incoming rate of change to the “viscosity” of the infrastructure. Or, inversely - make the system’s interaction with the outside world more “viscous” so as to slow them down to the point where we can promise greater certainty. Sometimes fully blocking the changes that can potentially bring the whole system down. And gradually opening the gateway again once we’re further from the critical point.
Data analysis and machine learning are of course key to such advanced infrastructure management patterns. But this approach goes beyond the basic anomaly detection that most of existing "AIOps" solutions offer. This involves arming our systems with capabilities of continuous self-introspection and self-remediation.This is also about predicting what the next phase of a system may be as the effect of change that we plan to apply. Taking into account the amount and - even more importantly - the kind of change.
Semantic Change Management (or Not All Changes Were Created the Same)
Semantic change management is the other missing piece of the resilience puzzle. In most current software delivery studies we are usually focused on quantitative analysis: deployment rate, lead time, exception count, etc. But practice shows that overwhelmingly the question “what was deployed?” is much more important in problem analysis than “how many deploys were made?”. It’s the type of change and not the rate of change that makes or (more often) breaks a system.
The exact typology of changes varies, based on the type of a system. But for almost all information systems one can broadly separate all changes into code, infrastructure, and configuration changes. This division can be made more granular by separating frontend from middleware from backend, by separating the cross-system configuration from isolated component config, and so on. With each change type holding its own properties that define its potential impact on the system under change.
Software delivery systems that we’re creating now need to allow for codification and analysis of these organization-wide change semantics . This is the prerequisite for the identification of phase transition states described in the previous paragraph. This semantic typology will allow for a granular definition of deployment strategies (such as, for example, continuous canary validation techniques) applied to each and every change. And for analyzing if the type and size of the change is something a system has the adaptive capacity to absorb in its current state.
To Summarize
This article is an attempt to outline two of the most important missing pieces of continuous change management in modern and future cloud- and edge-native IT systems:
- Phase Transition Analysis
- Semantic Change Management
These capabilities (enabled by data analysis and machine learning) are seen as the prerequisites for making a system semi-autonomously capable of resilience (or as Mark Burgess, whose ideas have influenced me tremendously, would put it - immunity).
The mechanisms for enabling these capabilities are being created as we speak. Once they are operational and well-trusted - the vision of continuous deployment, or of “Liquid Software” as defined by Sadogursky, Landman and Simon will finally start to become reality. And the face of what we now call DevOps and of our industry as a whole will change beyond recognition.
This is the future we want to build.
Thanks to Mark Burgess and Leonid Mirsky for reviewing and providing valuable comments!
Top comments (0)