Gavin Campbell

Posted on Jan 24, 2020 • Edited on Jul 28, 2020 • Originally published at gavincampbell.dev

Human aspects of Dev / Ops Convergence

#devops #culture

Over the last decade or so, much has been made of the need to "bridge the chasm"¹ between software development teams and IT operations teams. This promises a number of technological and organisational benefits, such as faster delivery cycles, fewer defects, reduced time to market, and greater profitability.

For those companies "born in the cloud", who have never deployed anywhere other than Firebase and think that Helm is how you steer the yacht you bought with the Series C funding, what follows will be largely unfamiliar. These companies have, in effect, NoOps² and no chasm to bridge.

There is a much larger number of organisations that are somewhere on the spectrum between having siloed dev, test, and operations teams, and the NoOps unicorns.

For this second group, there are a number of challenges, many of which have technological solutions, but the most intractable problems, as ever, are the human ones.

The DevOps Centre of Excellence

A common first step on the DevOps Journey is to reorganise the existing menagerie of Security Specialists, Network Architects, and Database Administrators into a DevOps Centre of Excellence, ready and able to respond to the needs of the developer tribe in the blink of a Service Request, within documented and agreed SLAs, subject to approval by the CAB which meets every second Tuesday.

The purchase of a foosball table and a kegerator can be considered optional at this stage on the DevOps journey.

It should be uncontroversial by now that this isn't an improvement on the previous arrangement of Dev and Ops silos ³, as it preserves almost all the downside with little upside other than the kegerator:

The motivations of the "DevOps Team" and the "Dev Team(s)" remain misaligned.
- Failure to deliver a feature on time rarely has any negative consequences for the DevOps team.
- Delivering features in a manner which is complicated or expensive to deploy and patch rarely has any negative consequences for the developers.
The traditional antipathy between Dev and Ops remains present.
- The DevOps team continue to regard the developers as rogue elements with no concern for the security or stability of the system.
- The developers continue to regard the DevOps team as dinosaurs whose only purpose in life is to slow down development.
- In many organisations, both of these characterisations are accurate.

I have touched on these issues in a previous article on this site:

Throwing code over a different fence

Gavin Campbell ・ Dec 11 '17 ・ 3 min read

#culture #devops

These downsides remain regardless of the number of tools-based initiatives undertaken by the DevOps team.

As long as the CloudFormation and ARM templates, the Ansible playbooks and the Helm charts remain the responsibility of the central DevOps team the misalignment of incentives will persist.

This obtains even though the tools and techniques - source control, CI/CD servers, Agile methods, might be identical across the two teams. None of these tools can break down the communication barriers between the two teams.

The Distributed DevOps Team

The next step along the journey often involves dismantling the central DevOps team and "embedding" one or more DevOps engineers within each development team. This is often inspired by the notion that "Facebook does it that way ⁴".

However:

Your company isn't Facebook.
Facebook's Production Engineers probably write more code than your developers

For most companies other than Facebook, the main benefit of embedding DevOps engineers within development teams is the reduction of communication overhead between developers and DevOps Engineers.

Since the management structures associated with the software teams are "unable to provide career progression for Ops people", and the software leaders are "unable to manage Ops people", the embedded ops people are often managed separately to the developers with whom they work, and may even be rotated between otherwise stable development teams.

The child who is not embraced by the village will burn it down to feel its warmth

Anon., the Internet

The consequence of this is that whilst communication overhead may be reduced, incentives are still not aligned between the feature developers and the embedded Ops people. Rather than "throwing code over the fence" into the Ops field, we throw code into the Ops "cage" embedded within every team.

Furthermore, tasks remain strictly categorised - and often "tagged" - as "Dev tasks" and "Ops tasks". This means that the developers continue to complain about the DevOps people, the DevOps people are not motivated to help out the developers.

Another effect is the daily standup where two or three groups of people take turns discussing their otherwise unrelated activities, supervised by the "scrummaster slash project manager". Such meetings are a well-rehearsed variation on the theme of "standup as status meeting"⁵.

The changing nature of IT Operations

There's a case to be made that not much has changed in the world of feature development over the fifty years or so of the software industry. After all, we still type code into editors, the code gets compiled or otherwise assembled into something useful, we might do a bit of testing, and we ship the end product to the users.

The same cannot be said of IT Operations. Whilst automation tools have existed since the very beginning - before interactive editors, in fact⁶ - much of the history of IT Operations has involved defining procedures and documenting them in a way that they can be executed by humans.

For many years now, most application software has been installed in a more or less scripted way, and many companies have found automated ways to ensure that their operating system environments are consistently configured.

More recently, we have seen the rise of Infrastructure as a Service and Platform as a Service offerings, with many modern applications reliant on both. This has brought about changes in tooling to the extent that almost any operating environment can be defined by a set of text files to be processed by some tool or other. This is what is meant by "Infrastructure as Code".

The real DevOps story isn't collaboration, it's convergence

Infrastructure as Code turned out to be so convenient that developers were able to provision their own environments without reference to the Operations team. Naturally, this was accompanied by much grumbling about "rogue developers", "shadow IT", and "out-of-control spending". Meanwhile, the operations teams were busy learning to use these new tools to provision more infrastructure, and faster, than ever before.

So, if application development and infrastructure development involve much the same tools and techniques, is there any value in having separate people to do each of these tasks? And given the relatively high degree of coupling between applications and the operating environments they consume, does it make sense to manage the code that defines them separately?

Given that the lines between front-end code, back-end code, database code, and infrastructure code become fainter with every year that passes, isn't it time we just referred to it as "code"? After all, aren't we all supposed to be "full-stack developers" these days?

Should we not be aiming to develop teams of "T-Shaped" people able to maintain any part of our product, from the infrastructure to the CSS?

Convergence within the team

The hardest part of this convergence is how to manage the integration of the people already working in the organisation, in particular the "Ops professionals" with little to no coding background.

In the modern software workplace, many tasks that were previously done by the Ops teams now need to be done with code. This means that it is increasingly unacceptable for a Database Administrator, a Security Analyst, or a Network Specialist to claim to be "not a coder", just as it is unacceptable for a Full-Stack Developer to "only know jQuery".

A more subtle problem is that the Ops teams may be proficient in writing code, in the form of scripts, templates, or playbooks, but less experienced in the other practices that go into software development such as continuous integration, automated testing, and branching strategies.

Our aim is to capture the expertise of the Ops people, as well as the skills of the developers, since both are important to the quality of the end product.

The developers don't want to do operations and the operations people don't want to do development

Overheard in the elevator

I can't think of any technological solutions, not even Pulumi and friends, that are capable, by themselves, of achieving this convergence.

Tools are important, of course, and it has been my experience that the more developer-friendly the tools can be made, the more the developers are going to entertain the idea of "doing operations". In the first decade of the cloud, a lot of infrastructure code was written using vendor-specific DSLs such as CloudFormation and ARM templates. These satisfy the definition of "Infrastructure as Code", but I think the reason for the popularity of Terraform is that it has features that make it more "code like", including easier composability and reduced verbosity.

The history of database development is a similar story; for many years there were people specialised in creating database objects and PL/SQL procedures, yet in most modern applications these have been replaced by abstractions such as Hibernate, Entity Framework, and Dapper.

The promise of Pulumi is that it will offer a similar abstraction for our infrastructure, and I think it will be interesting to see how this turns out.

Rather, the solutions are all working practices, most of which should be very familiar.

Pair Programming

Pair programming has been around for a very long time, but is still a challenge for many teams⁷. In the context of Dev/Ops convergence, it has many benefits.

Pairing promotes collective code ownership, including infrastructure code, which should mean the end of "application issues" and "infrastructure issues".

If the team members come from a range of different backgrounds, all the better. The developers learn to "do infrastructure" whilst the Operations people learn to "do development". In time, it is hard to tell who used to be a developer and who used to be a Senior Infrastructure Analyst, as both are equally comfortable maintaining the codebase.

This is a pattern that is already emerging in software testing; Microsoft, among others, claim to have abandoned the distinction between developers and testers⁸, and I suspect that in the future we will be saying the same thing about infrastructure engineers.

One enabler for the knowledge-sharing benefits of pairing is the practice of pair rotation, in which we encourage each team member to pair with a range of other team members. This can be reinforced by tools such as the pair programming matrix⁹.

In the situation where we are integrating people with non-development backgrounds, this is even more valuable, as it prevents any one developer from being the one who always has to pair with "the guy from ops". It also allows the people with infrastructure backgrounds to become familiar with many different areas of the codebase, and to share their expertise with many other team members.

Pair programming and pair rotation are not without challenges, and required dedicated effort to ensure success.

It is worth mentioning the practice of "mob programming¹⁰", in which the whole team works together to deliver a feature. This can also be an effective way to promote knowledge sharing, remove barriers, and develop the skills of the whole team.

Automated Testing

If a user story is independently valuable, it stands to reason that it must be independently testable. The convergence of applications and infrastructure means that we can test both at the same time.

To do this we need to write tests not only for the functionality of our application, but also for the security, availability, and observability of our entire environment. This by itself is a huge benefit of the converged approach.

Infrastructure testing is integration testing with real dependencies, and this is an area where the contributions of people with operations experience can be especially valuable.

Thin vertical slices

It is a well-known agile practice to attempt to define user stories as "vertical slices" of functionality¹¹. This means that rather than attempt to deliver a complete database, followed by a complete business tier, followed by an Enterprise Service Bus, followed by a front end, we deliver just enough of each tier to be able to show something of value to the end user.

As well as these tiers, we can also deliver just enough infrastructure to support the feature we want to deliver. This practice might even result in reduced infrastructure costs.

Adding infrastructure and associated tests to every story may make these stories larger and more complex to deliver. The answer, of course, is to make the slices thinner, thin enough that they are deliverable whilst still adding value.

Defining sufficiently thin vertical slices is not always easy. However, investing time in this will make the other practices easier - thin slices are easier to test and easier for a pair of developers from diverse backgrounds to deliver.

Conclusion

Infrastructure as code in't the future, it's the present. The transition to everything-as-code is likely to be harder for the infrastructure professionals than it is for the developers.

There is a disconcerting remark in the Microsoft article about testing convergence linked above⁸

It was a painful transition for the company. We worked hard to move [test engineers] to other roles, and some did, but a good number of them didn’t.

You don't want this to be your company and you don't want these to be your infrastructure people. These are difficult problems, but the practices outlined above, along with others designed to detoxify the workplace, can go some way towards mitigating them.

There may be some infrastructure professionals who resist the transition to everything-as-code. For the time being there may be work for them to do in the "infrastructure that supports the infrastructure", meaning the subscriptions, contracts, invoices, and service agreements. These tasks, which are more administrative than technical, represent the future of the "non-coding IT professional".

For the others, convergence creates an opportunity to develop new skills, and to have a more visible impact on the finished product. Across the team, the aim is not an unrealistic homogenisation, where every team member is a "resource", but to benefit from the diversity of experience of the team members.