Be careful what fails your pipeline

#pipeline #continuousdeployment

At this point we're well into the #devops movement and deployment automation, so if you work in a team you likely have an integration/deployment pipeline that run when you commit changes to your codebase. Maybe it runs in Jenkins, Travis, GitHub workers, or something else, maybe it's simple or complex, maybe it orchestrates lots of cloud units or rely on uniquely configured machines, but however it works you probably have a pipeline.

What does it do? At a minimum it should run unit-tests, and it probably does something related to deployment (even if just to prepare resources for a manual deploy). You're probably doing a lot more than that, but I expect you built your pipeline to ensure your features can go out to your customers reliably. Like a gatekeeper that guarantees your code is deployable.

This post is about some things I suggest shouldn't go into your pipeline. Maybe that sounds backwards, but the list of things your pipeline can do is infinite and often specific to your domain and workflows. But there are certain suggestions I've heard multiple times that I think are problematic, and worth talking about more broadly.

Example 1: Screenshot-comparing tests

One idea that comes up when discussing UI testing is to capture pages as screenshots and compare them to a known-good baseline. This approach has many sharp edges that can bite much more than be useful though.

The first problem is simply that browsers do not necessarily render things deterministically, and certainly they don't across versions. So naively comparing screenshots will create failures where just a single pixel gets shaded differently, which is maddening to work with. It's easy then to suggest adding a fuzzy threshold before considering the new screenshot different, and that can appear to work but it's very easy to then fall into the issue of tweaking that value trying to find just the right balance between false-positives and actually interesting differences. It's not fun.

You can even push deeper, and go for some of the various online services that do advanced AI-powered difference detection, and actually they really are helpful for arranging and visualizing differences they still don't address the fundamental problem: The deployability of your application does not depend on how it looks*.

What do I mean by that?

Of course your application has to look a certain way to be deployable, but I assert a human must decide what is passable because it is a constantly evolving target to meet the needs of your customers. Put another way: The appearance of your application is not a measure of correctness, that's a subjective quality concern. I drove myself mad chasing pipeline errors because of color changes and minor movements to elements, and it took me a while to see that what I was doing was fundamentally wrong.

Why is it so wrong? This isn't to disparage screenshot capturing, I think its wonderful if a pipeline generates screenshots. But they need to be looked at, discussed, shared, and tracked over time, rather than be seen as a measure of correctness. Consider them artefacts to generate discussions and show historical trends to direct your efforts, but not as go/nogo automated gatekeeping.

So diff those screenshots, but show those that differ, don't block the pipeline. If a pull-request can show all the pages affected that's hugely powerful, but if a PR is blocked by an element moving then that is not a net positive.

Example 2: Fail on low test coverage

(If you don't know, coverage reports are generated by capturing which lines are executed as unit tests run, to build a report that shows which lines of code are covered by the tests)

I love coverage reports, they are key to understanding which parts of a codebase need more testing or even if there are areas suffering from architectural calcification where code is over-tested (if dozens or more overlapping tests all go over the same logic over and over that area might have an unfortunate coupling). But in talking about coverage I've heard it proposed many times to fail a pipeline on low coverage. So if new code is less than n % covered we fail the pipeline. That way we all have to write quality code and we'll all be happy, right?

No, this leads straight to an anti pattern: Some day soon someone will urgently need to land a change, and it just won't cut it to be denied because of low coverage. Yes we can all agree code should be tested, and yes we shouldn't be doing urgent things quickly, but real life happen and we need our solutions to adapt and help rather than block. The fundamental problem is once again that coverage is not a matter of correctness*, but it's an important artefact to generate discussions and awareness and track historical trends. Don't block your pipeline on it, but do highlight it as part of your builds.

(As an aside, it's the same argument when talking about code quality, using tools such as SonarQube. I love SonarQube, it's an absolute stable of my workflow and I use it to drive discussions, but failing a pipeline because it identifies a line as low quality is too risky because quality != correctness)

Example 3: Performance testing

Synthetic performance testing in a pipeline is a whole big topic onto itself, full of nuances and complexity just for setting up a platform that can be measured reliably. I'd be very interested to hear from those who've tackled this because so far I've not found a case where I wanted to push towards performance metrics as part of the pipeline… but I've certainly heard the suggestion come up quite often!

I argue against it for reasons of complexity alone, but even if I knew how to reliable collect the metrics before deploying to production I don't see how it should fail the pipeline. Of course performance is a key part of a quality product, and bad performance is a problem that only grows more difficult to handle the longer performance-metrics are ignored, but creating a failure because something now runs slower invites endless twiddling of thresholds and is not adaptive to situations where your team might want a simpler, slower solution to test some theories.

Performance should definitely be captured on production, to measure actual, real performance. And if you push into pipeline performance that's great!, but surface that data to humans rather than blocking the pipeline.

Bottom line

The list could go on, but the bottom line is to be careful what you consider a pipeline failure. Definitely fail it where you can guarantee the code isn't correct, for example unit tests are a natural fit that I think everyone does. But also include integration-tests, service-tests, smoke tests, and yes also UI tests in that list, it all just has to speak to the correctness of your application. For example a test that clicks a set of buttons to check out and purchase a product is a great correctness-check, and should absolutely fail the pipeline if it can't be completed. But failing on softer conditions that do not fundamentally speak to correctness means you'll be forcing your project into hard pipeline constraints and that just won't adapt to real-life events.

It tends to be safe to add as many pipeline artefacts as you want to drive rich, useful discussions, so definitely don't hold back on test-automations. But only once they demonstrably show themselves to really speak to the product's fundamental correctness should they be considered to fail the pipeline.

* I make these claim without knowing your domain so it may very well be you actually do require what I say you don't. Maybe you have legal requirements, special audit rules, or you operate at such scale that these really are inviolable constraints. I'd love to hear about those cases! But hopefully I speak to the general case where the occasional bug is not life-threatening.

DEV Community

Be careful what fails your pipeline

Example 1: Screenshot-comparing tests

Example 2: Fail on low test coverage

Example 3: Performance testing

Bottom line

Top comments (0)