White-box testing is possible when you have access to internal structures
I read and summarise software engineering papers for fun, and today we’re having a look at What it would take to use mutation testing in industry – A study at Facebook (2021) by Beller and others.
Mutation testing is a way to determine the quality of your test suite. It works by generating a large number of changed versions of the code, which are called mutants. Examples of changes include deletions of method calls, disabling if conditions, and replacing magic constants.
If the test suite is good enough, it should be able to “kill” these mutants by having at least one previously succeeding test fail.
Why it matters
The result of mutation tests is a so-called mutation score, which is the ratio of mutants that a test suite manages to kill. Many researchers and developers argue that mutation scores are superior to traditional code coverage, as it’s actually based on a program’s behaviour.
But mutation testing is not a silver bullet:
Mutants can be generated in many different ways, which means that mutation testing becomes infeasible for anything but the smallest code bases.
It is also not clear to developers what they can do to improve the mutation score, and whether an improved score actually has any practical benefits (other than better-looking metrics).
Can these issues be fixed?
How the study was conducted
The authors of the paper built a tool that they call Mutation Monkey. It comes with two pipelines, a training and an application pipeline.
Mutation testing is often very costly – not only because generating all the different mutants takes a lot of time and processing power, but also because many of the generated mutants are easily killed (or not even syntactically valid) and thus useless.
The training pipeline solves this problem by semi-automatically learning bug-inducing patterns from three sources:
Defects4J, a collection of bugs extracted from popular OSS Java projects;
An internal database of fixes for crashes that happened in the production version of the Facebook app. By “reversing” these fixes it becomes possible to reintroduce crashes;
Commits with modifications that made an originally failing test pass.
This process is only partially automated, because experts are still needed to decide which and how many patterns to implement, and for the creation of patch-like templates that implement the patterns.
The application pipeline applies the mutation templates to the production version of the code. To reduce the number of mutants that have to be generated (remember, building and testing is expensive!), the pipeline tries to avoid “unprofitable” spots, like logging calls, and runs a light-weight syntax checker to catch syntactically invalid mutants.
The remaining mutants are submitted to the code review system outside of peak (office) hours, which makes scaling easier and is cheaper. Mutants that pass the test suite are then presented to developers. The pipeline also tells developers which tests visited the mutated block of code. This information should make it easier for developers to decide what they want to do.
What discoveries were made
Kill rates were fairly similar across the various mutation patterns. However, some mutations were applied successfully a lot more than others. For instance, the NULL_DEREFERENCE
pattern was applied almost 2,000 times, while the REMOVED_SYNCHRONIZED
mutations only occurred 143 times within the same period of time.
Interestingly, the REMOVED_SYNCHRONIZED
is also the only pattern with a much higher kill rate, which suggests that developers are aware that synchronisation-related bugs are hard to debug and thus spend more time writing tests for them.
The researchers also conducted interviews with 29 developers to learn more about the effectiveness of Mutation Monkey’s approach.
Most – if not all – developers had not heard of mutation testing prior to the experiment, and needed more information than what was provided by Mutation Monkey.
However, after explanation from the researchers about 85% believed that Mutation Monkey is a useful tool that could help them write (better) tests. Virtually everyone was also positive about the test coverage information that was included with the test reports.
However, less than half of the developers confirmed that they would write a test for the gap that Mutation Monkey had found. When asked why not, developers often gave the following reasons:
- they want Mutation Monkey to come up with a test;
- the mutated code was of minor importance;
- the mutated code was about to be deprecated;
- the code was still new and likely to undergo iteration before stabilising; and
- the mutated code is in a badly tested part of the code base (😕?!).
In other words, this new approach seems to be better than existing approaches, but still yields too many false positives.
Top comments (0)