Antoine Coulon

Posted on Jan 19, 2023

Don't target 100% coverage

#productivity #webdev

Don't target 100% coverage... but achieve it anyway!

I recently noticed that a lot of people were advising to reach 100% of code coverage and were presenting it as a primary indicator of code quality. While I agree on the fact that it's an interesting metric, everyone must be aware that having 100% of coverage does not indicate the quality of tests and the way to achieve a good coverage is more important that being able to produce 100% of coverage as an end, it must not be a target in the first place. Let's see why.

Code Coverage

First things first let's briefly explain what is code coverage for those who are not really familiar with the concept.

Code coverage is a technique aiming to determine what parts of your program is being covered by a test. Basically, the code coverage tool instantiates counters and they are incremented once a line of your program is traversed by one of your tests.

Here is a quote from istanbul, one of the most used code coverage tool:

Istanbul instruments your ES5 and ES2015+ JavaScript code with line counters, so that you can track how well your unit-tests exercise your codebase.

Code coverage tools often collect various metrics such as:

% of statements traversed
% of branches traversed
% of functions traversed
% of lines traversed

Overall, code coverage tools just collect the amount of production code traversed by a test, whether the assertion is relevant or not, which is where the devil resides.

The trap 👹

Let's see a trap example with a simple project using c8 as our code coverage tool (istanbul unfortunately does not natively work with ESM).

You can grab the source code here.

index.js

export function add(a, b) {
  return a + b;
}

Now here is one simple test that we would expect to be truthy:

index.spec.js

import assert from "assert";
import { add } from "./index.js";

it("should return 1", function () {
  assert.equal(add(1, 0), 1);
});

When running the test, we can see that it turns to green and the coverage is automatically generated for us:

As we can see, everything is 100% covered by tests, all the statements, all the functions, all the branches, etc. So everything is fine, so we could just ship the feature right? 🧐

But what if someday someone just changes a little detail, pretty confident that it will work as the unit test still succeeds after the change.

Thanks to Michaël Azerhad who provided me that example while back!

export function add(a, b) {
+  return a - b;
-  return a + b;
}

The test suite still passes and coverage is still showing us 100% of coverage.

As simple as the example may seem, it demonstrates that coverage only checks that a given chunk of code is at least traversed once, but it will never tell you if a test is valuable nor useful in some kind of way. Consequently there is a danger which is that it can make you feel (wrongly) safe about your code, just based on the fact that they are traversed by a test at some point in time.

Don't let "100% coverage" be your primary concern 🖐

Don't get me wrong, I'm not saying that coverage is not useful in itself.

However aiming towards reaching 100% after having written the code will most likely produce false positive tests in a "Test-Last" fashion, just to make statistics look better. I would even say that Code Coverage Driven Development (CCDD) sometimes inherits from the disadvantages of the "Test-Last" approach, that is most of the time either writing irrelevant or even worse useless tests. They both make you feel safe and confident as the console only shows green lights, but it won't avoid you to get in trouble.

Mutation testing, an under-estimated tool 🔍

Because I'm lazy, I'll once again simply quote the definition from the Stryker documentation

Bugs, or mutants, are automatically inserted into your production code. Your tests are run for each mutant. If your tests fail then the mutant is killed. If your tests passed, the mutant survived. The higher the percentage of mutants killed, the more effective your tests are.

Basically, it does something fundamentally different from code coverage which is adding variants to branches of your program. By mutating your code, the mutation testing tool can detect whether your test suite was correctly covering or not the behavior produced by the mutation.

Let's try it on our small example using Stryker.

$ npm run mutation-test

Running the test produces the following output:

As you can see, one mutant survived which is the one changing the Arithmetic Operator which is exactly what we were looking for. Consequently, we can see that it does a step forward by allowing tests to be tested via the mutation of the production code itself. The more your tests kill mutants, the more they tend to be relevant. You could have 100% of coverage by having only 20 to 30% of the mutants killed, which is a big smell. Be careful with code coverage!

You might have guessed it, but mutation testing tools can also provide code coverage reports.

Automatically reaching 100% coverage and 100% of mutants killed using Test-Driven Development 🤹‍♀️

With a simple example we just showed that coverage as such is not enough to ensure tests quality. We also saw that mutation testing helps finding incomplete/missing test suites. What if I told you that a good coverage and a high rate of mutants killed can be a positive side effect of a discipline that does not care about any coverage in the first place?

Test-Driven Development

This article is not dedicated to Test-Driven Development (TDD) is, but I'll try to make a short introduction.

Test-Driven Development is a software development discipline that aims to drive the writing of code required to make a failing test ❌ (prerequisite) pass i.e. turn to GREEN ✅ in very short feedback loop cycles. Test-Driven Development fundamentally helps designing software as it allows to infinitely and safely refactor (at any point in time) the code produced, which is one of the main benefits.

When doing TDD correctly, because each line of code produced must be justified by a failing test in the first place, 100% coverage will be naturally achieved. All the mutants will also be killed if TDD was done correctly, as changing a line of code or one statement will without a doubt break one test, otherwise it means that some code was shamefully produced without having a failing test first!

Not only you'll achieve 100% of code coverage and 100% of killed mutants, but you'll be able to produce a code with a higher level of quality, depending on your refactor skills (because TDD can drive you writing a well designed code but the skillset required to do so has nothing to do with TDD).

Of course Test-Driven Development is not a silver bullet, but it is to date the best way of mixing both code quality with code coverage and the best part is that you'll most likely do it unconsciously.

Wrap up

Don't get me wrong, code coverage is useful in many ways, as it can show how well a codebase is covered by tests at a very high level, for instance having a very low percentage of coverage means that there is not enough tests, so it's good metric to start with.

Nevertheless, having the opposite which is a high percentage of code coverage does not mean that tests are relevant. The best way of achieving both a high percentage of coverage and tests reliability is by not targeting it in the first place, otherwise it might lead people to write more or less (often less) relevant tests just make the percentage a little bit better, the worst part being that it makes everyone feel safer.

By also using mutation testing tools and by mastering disciplines such as Test-Driven Development (or at least Test-First), you can end up achieving 100% coverage and 100% mutants killed by not caring at all (0%)% of all these metrics. That, is the target.

Project sample can be found there: https://github.com/antoine-coulon/dont-target-100-coverage

Top comments (20)

Charles F. Munat • Jan 24 '23

You should only cover that which needs covering, no more, but that's most things, really.

But it is unwise to choose an arbitrary number, say 70% or 80%, and to require that level of measured coverage. Remember that the tool does not measure many things, so even 100% measured coverage is probably not 100%.

The problem with this approach (arbitrary levels of coverage) is that the coverage report is always filled with reds and yellows. You become accustomed to that and don't think twice about it. It's like getting used to an alarm and so you no longer hear it.

And what is it exactly that you're not covering? Can you tell? Do you know? Do you know why or why not?

So a better approach is to always require 100% coverage as measured by the tool (e.g., Istanbul). Then choose which features, paths, etc. do not need coverage and disable coverage for those features in the code and with an explanation.

Now your coverage report is binary: it is either PASS (all green) or FAIL. Easy. And you can automate it (CI/CD), too, to prevent deploying bad shit.

And you can see with a simple global search what code is not covered and why. It is evident in code reviews, too.

And it's not arbitrary! You may find that you really only needed 50% coverage. Or that you'd have been living dangerously with less than 95%. In short, you get exactly the coverage you need, no more, no less, and an easy metric for success.

It astonishes me that this practice is virtually unheard of. I've been using it for probably a decade with great success. And if you write clean, simple, understandable code, you may find that 100% measured is easy to achieve even without ignoring anything.

In fact, if something is difficult to test properly, then it is probably poorly written. Better to fix it.

Charles F. Munat • Jan 24 '23

Struggles with coverage also probably mean you're writing too many unit tests and not enough integration tests. Testing how things work in isolation (unit) is almost worthless. I save it for simple utility functions, if then.

The best tests test the code as it is expected to work in the wild. Preferably in production mode and on production software.

Antoine Coulon • Jan 25 '23

Thanks for sharing you approach @chasm. As you mentioned there are many interesting ways to use coverage, the only thing is that it requires a good understanding of what code coverage is all about, and disciplines/tools to achieve it without effort/pain.

In fact, if something is difficult to test properly, then it is probably poorly written. Better to fix it.

Exactly, and that's when TDD appears to be super useful, one must write code that is easily testable otherwise can't process to quickly and easily pass from RED to GREEN .

The best tests test the code as it is expected to work in the wild. Preferably in production mode and on production software.

Integration tests are also very interesting and yes, ideally we should be executing them in production. Two resources I like about the subject:

Charles F. Munat • Jan 25 '23

Hi, @antoinecoulon. I agree. Thanks for the links. They look like good resources. I'll check them out.

Christian Ledermann • Feb 8 '23

My experience is the other way round:
Struggles with coverage also probably mean you're writing too many integration tests and not enough unit tests.

Having said that, I work mostly on backends and libraries, front-end developers may have different experiences.

Charles F. Munat • Feb 8 '23

That makes no sense to me at all. I wonder if we're talking about the same things? I can write one integration test and test parts of several different units. A couple of integration tests and I might not need any unit tests at all.

By unit tests I mean those that test a function or component in isolation from other functions and components, often by mocking out dependencies. Sometimes called anti-social testing.

By integration tests I mean moving the mocks to the edges of the app -- only mocking external resources (generally API calls) -- then allowing the components to interact the way they will in production.

My feeling is that if you're finding integration tests too difficult to write (they are easier in my experience), then it's time to rethink your architecture and your choices.

I don't see that backends are any different (I've built plenty). Libraries are probably more suited to unit tests, although functions intended to work together should be tested together, IMO.

Can you explain how and why integration tests (if we're talking about the same thing) make coverage a struggle?

O ji nwayo e je • Jan 25 '23

aiming towards reaching 100% after having written the code will most likely produce false positive tests in a "Test-Last" fashion

the "Test-Last" approach, that is most of the time either writing irrelevant or even worse useless test

Do you mind elaborating, please? If it's too broad for you to do justice to here, links are welcome

Antoine Coulon • Jan 25 '23

Thanks for letting me the chance to elaborate on that subject. Before sharing my thoughts here is one precious and valuable link I could share with you is that wonderful talk by Ian Cooper, explaining which pain points you'll face when writing tests last.

Putting that talk aside, my humble opinion would be:

Test-Last will favor the writing of tests that are irrelevant because the code is already written, so how can you prove that your test is serving a specific purpose? Say you write 150 lines of code and after that you write 3 tests which all of them pass. How could you know if they pass because of your code given that in the first place there was no tests that asserts this was failing without the code? The danger here is that these tests that serve no purpose wrongly augment the confidence in the code produced.
Test-Last is one of the reasons developers might not write tests, as they become painful and hard to write once all the code was already written. When writing tests after, you might end up dealing with implicit dependencies on which you have no direct control, making the ability to easily test the behavior of the system fly away, and making also the tests non-deterministic for some cases. Take for instance a simple use case where some kind of date (Date.now() -> non-deterministic) is involved, how you're supposed to test it afterwards if you don't realize in the first place that this might be done differently for the test to be simple to write?

function bookHotelRoom(roomRepository, bookingRepository) {
   const room = roomRepository.findById(roomId);
   // Implicit dependency
   const bookingDate = Date.now();
   bookingRepository.book(room, bookingDate);
}

Then when writing the test after:

bookHotelRoom(_, _));
// Blocked as we don't have any control over the date that will be implicitly provided
const booking = { date: ??? };
expect(bookingRepository.listAll()).toEqual([booking]);

Now that's a simple case, but imagine writing dozens or hundreds of lines of code at once and then trying out to figure how you could test that behavior having multiple nested implicit dependencies? Doing Test-Last will make just feel that writing tests is the worst part of the day

O ji nwayo e je • Jan 26 '23

Watched the talk cuz this is a highly sensitive topic for me. Sadly didn't find a comparison between both paradigms there. But I appreciate you for obliging me.

My thought is that test-last by itself isn't a poor pattern to follow. If your hypothetic developer decides to test his 150 lines shabbily, that's his fault. He is probably not confident about the tests either. But you have the @covers annotation to focus coverage report surface area at a point of interest. This should answer the question of knowing whether the code influences the SUT. It's one thing to influence SUT but whether it does in an intended way can only be answered by mutation testing tools. That's not an exclusive preserve of test-lasters

Implicit dependencies, again, is a fault of the developer, not the paradigm. People attribute arriving at good design to TDD. Maybe it's just me but I've never encountered an issue with wiring dependencies or the container just to make dependencies available

Now, I suspect this reply may come off dismissive even though that's not the case. I'm open to any argument I can relate to that can convince me that testing last is actually detrimental. If you find any more, I will be here

Antoine Coulon • Jan 26 '23

There is no direct comparison in this talk, but it exposes facts about what is most likely going to happen if you don't follow either a Test-First or a TDD approach.

My thought is that test-last by itself isn't a poor pattern to follow.

Of course some developers can write reliable tests in a Test-Last fashion, my point (kinda pushed to the extreme, I admit it) is just based on the fact that by doing Test-Last you have more chances to end up writing tests that are doing nothing relevant or they just stand as confidence tests that are useless in the sense that other tests already cover the use cases. When doing Test-First or TDD, you're by design forced to implement a behavior following a spec described in a test case, while when doing Test-Last it's way more harder (nothing impossible) and less natural to achieve it in the same way. I have been doing mostly Test-Last and I'm not afraid to admit that I had tons of headaches when trying to write good and reliable unit tests after few hours of development, writing multiple use cases including each variant etc. And even after that, I don't even had guarantees that I tested correctly the whole flow of the code I produced.

Implicit dependencies, again, is a fault of the developer, not the paradigm. People attribute arriving at good design to TDD.

As I mentioned in the article:

Test-Driven Development fundamentally helps designing software....

Being able to design correctly has nothing really to do with TDD, it just that having the constraint of writing testable code in the first place and in an incremental way helps you finding the good code design, breaking complexity piece by piece. If you're good at design software then TDD will just allow you to do it even quicker and safer. As you appear to be curious on that subject I must recommend you the book about TDD with C++ amazon.co.uk/Modern-Programming-Te... by Jeff Langr which is probably the most valuable resource on that subject, easy and quick to read, you'll understand what are the overall benefits of having a First (Before) instead of a Last (After) approach. There is no way you won't become a better developer after reading it :)

Thibaut Andrieu • Jan 20 '23

Thank you for reminding that code coverage is a metric, not an objective 😊

I would add that auto-test has never ensured your software works. It ensures it works like before. Tests freeze the behavior and architecture to ensure code modification won't change behavior in an unexpected way.

If you put too many tests, you might freeze your software in a way in won't be possible for you to make evolution without refactoring lots of tests.

And unfortunately, this generally happen when you put an objective of code coverage. You end up doing white box testing only for the sake of reaching that corner case line of code that never happen in real world.

Antoine Coulon • Jan 22 '23

@tandrieu thanks to you for that feedback!

I agree with you on the point that adding test to boost up metrics will never be a good thing as you underlined, it might get the software frozen and put developers in a situation where refactoring is impossible as tests are too tightly coupled to specific implementations.

IMHO that's where discipline like TDD (when doing it well) gets highly effective, in a sense that tests get written as a consequence from implementing specification and are not really coupled with implementation details, so it gets higher enough to have the ability to refactor but also to have a good enough "coverage" of the production code.

Christian Ledermann • Jan 25 '23

Goodhart's law: When a measure becomes a target, it ceases to be a good measure.

I could not agree more with you, the way you describe here is pretty much the way I achieved 100% rock solid test coverage for pygeoif.
I used mutmut for mutation testing.

Antoine Coulon • Jan 25 '23

@ldrscke love that Goodhart's law! Interesting to see that something that was initially related to economics also applies to software engineering.

Thanks for sharing that about pygeoif, you're now my official proof that I'm not saying any bullshit here 😎

Vincent Dhennin • Jan 19 '23

What your opinion about property based testing ?

Antoine Coulon • Jan 19 '23

Property-based testing is such an interesting testing technique that I could have mentioned you're completely correct, in fact Mutation Testing aims to achieve more or less the same goal which is finding variants (or mutations) that your production code doesn't cover. However we could consider them also complementary in the sense that the main difference is that PBT is in general meant to be easily configurable to test a specific set of entries with custom business rules whereas Mutation Testing does mutation things automatically under the hood using a list of pre-defined mutators (for Stryker here is a list of the supported mutators: stryker-mutator.io/docs/mutation-t...).

They both have use cases, but my personal opinion is that most of the time if you process to achieve TDD correctly, most of the expected business use cases should be covered. Moreover, testing things that shouldn't happen is often a smell in a sense that you could just ensure upfront that these things won't happen. Nevertheless for some parts of the software where the input must be controlled in a critical way, you can always end up adding another security layer (Unit Tests coming from TDD + PBT).

Note: I highly recommend github.com/dubzzz/fast-check for PBT in the JavaScript/TypeScript ecosystem.

Christian Ledermann • Feb 8 '23

A good introduction to property based testing are videos (mostly tutorials and conference talks) and blog posts about hypothesis. Never mind that it is a Python 🐍️ library, the approach is similar in other programming languages.

Sébastien Vanvelthem • Jun 12 '23

While I agree on the first part : 'don't target 100 coverage'. The second part 'but achieve it anyway!' is less obvious to me. Do you mean that we can run a mutation testing on projects with low / unexistent or less trustable initial coverage ?

Antoine Coulon • Jun 12 '23

No that's not really what I meant.

What I'm trying to say there is that the goal should be to achieve 100% coverage but indirectly, leveraging disciplines such as TDD, because of the Goodhart's law: *When a measure becomes a target, it ceases to be a good measure. *.
You have many ways to achieve 100% of coverage and the most important thing is not reaching 100%, it's how you managed to achieve 100%. As shown in the little example, by being solely focused on covering each line of code independently, you might feel (wrongly) safe and forget to cover other cases (not detected by coverage but by mutation testing). That's often (not always) the case when doing a Test-Last approach. That's why having a look just on the % of coverage in itself is not enough.

So while having less than 100% clearly indicates that your code misses some tests, the reverse is not true, having 100% coverage does not mean that your code has all tests it should have, and mutation testing helps you measure that.

Do you mean that we can run a mutation testing on projects with low / unexistent or less trustable initial coverage ?

Even if it's not what I meant (explained above), there is no relationship between the usefulness of mutation testing and the amount of code being covered by tests, it could still be valuable tu use mutation testing even if you're not on 100% coverage. Mutation testing evaluates existing tests, whether there are only 2-3 tests throughout the whole codebase. In that case it would at least make these few tests safer, even though you have 1% of code coverage

Christian Ledermann • Nov 12 '23 • Edited

I finally finished my A Tale of two Kitchens post that touches on mutation testing and coverage as well @antoinecoulon

View full discussion (20 comments)