Alex Weisberger

Posted on Apr 13, 2020

Towards More Effective Software Testing: Equivalence Partitioning and Domain Analysis

#javascript #reason #testing #tdd

Bugs bother me. In 2020, why haven’t we figured out how to release bug-free software? The truth is, a large part of making software is creating, finding, and fixing bugs. In some ways, this is unavoidable — I don’t think humanity is in a place where we’re capable of building large-scale software that is guaranteed to be free of all bugs, in the same way that we haven’t solved the majority of the Millennium Prize Problems. Shouldn’t ensuring that a program works as expected be as simple as testing every branch and code path of the code? We’ll look into how code coverage metrics fall short, and why, overall, software verification is a very difficult endeavor. Of course, there is hope, but it requires a shift in thinking.

A Game of Patience

Let’s look at an example to provide a setting for thinking about different testing approaches and their effectiveness. Solitaire (also known as Patience) is a category of playing card games that all follow similar rules. FreeCell is one such game.
The main aspect of the game is moving cards between columns in accordance with certain legality constraints (read: business logic). Columns of cards are commonly referred to as “cascades.” You are free to move cards between cascades as much as you want, provided they are legal moves. There are several other elements to the game, but we’ll keep the discussion to these moves and their legality. Here is what the game layout looks like:

One legality constraint is:

Single cards can be moved between cascades if the source and destination cards have different colors and they are in descending order.

For example, an 8 of diamonds can be moved onto a 9 of spades (as seen in column 3 of the picture) since they are different colors and 8 is one less than 9. Let’s write a test that asserts this constraint (code examples are written in Reason):

let testCascadeMoveLegality = () => {
  let eightOfDiamonds = {suit: Diamonds, rank: 8};
  let nineOfSpades = {suit: Spades, rank: 9};
  let tenOfSpades = {suit: Spades, rank: 10};

  let isEightToNineLegal =
    isLegalMove(eightOfDiamonds, nineOfSpades);

  let isNineToEightLegal =
    isLegalMove(nineOfSpades, eightOfDiamonds);

  let isNineToTenLegal =
    isLegalMove(nineOfSpades, tenOfSpades);

  [
    assertEqual(
      ~expected=true,
      ~actual=isEightToNineLegal,
      "8d -> 9s is legal",
    ),
    assertEqual(
      ~expected=false,
      ~actual=isNineToEightLegal,
      "9s -> 8d is not legal",
    ),
    assertEqual(
      ~expected=false,
      ~actual=isNineToTenLegal,
      "9s -> 10s is not legal",
    ),
  ];
};

I’ll start abbreviating cards as , e.g. “8d” is the 8 of diamonds. This test asserts that 8d → 9s is a legal move, and also tests two cases where the move is not legal: 9s → 8d and 9s → 10s. Let’s add a corresponding implementation:

type suit =
  | Clubs
  | Diamonds
  | Hearts
  | Spades;

type card = {
  suit: suit,
  rank: int,
};

let areCardsDifferentColor = (c1, c2) => 
  c1.suit != c2.suit;

let areRanksInDescendingOrder = (source, dest) =>
  dest.rank == source.rank + 1;

let isLegalMove = (source, dest) =>
  areCardsDifferentColor(source, dest)
  && areRanksInDescendingOrder(source, dest);

We run the tests, and everything passes. Ship it!

A few hours later, though, a bug report comes in. It turns out there’s a bug in the above implementation. Can you find it? Take a minute and try.

The bug is that the implementation of areCardsDifferentColor only checks that the cards’ suits are different. Since there are 2 suits within each color, cards can have the same color but different suits — e.g. clubs and spades. I ran istanbul on this code and it reports 100% code coverage across all categories. How did the bug still get through?

The Blind Spot of Code Coverage

In Toward a Theory of Test Data Selection[1] the authors analyze the different types of errors that occur in software implementations. The list probably isn’t exhaustive, but it’s useful nonetheless. They identify three error categories, none of which are reliably caught by testing all code paths of a program. For example, a missing control flow path is when the implementation fails to test for a condition that it should, and consequently doesn’t have a discrete path for inputs that meet that condition. In other words, there is a scenario in the real world that the code doesn’t recognize as unique or interesting. This isn’t just theory, this is exactly what happened in the previous bug. To fix that bug, there needs to be additional code paths that map each suit to their color. By definition, code coverage can’t alert us of bugs in this category because you can’t cover paths that don’t exist.

This is the blind spot of code coverage: it doesn’t account for all of the possible inputs to the program. If the control flow graph of a program is like a network of roads, 100% “code coverage” could be achieved by driving over each road once. But if you’re a postal worker, driving your route to completion doesn’t mean that you delivered mail to all of the correct houses.

In the same way, a program is more than simply a single traversal through all of its code paths. The code paths provide a set of possibilities (a road system), but the full behavior of the program is determined by the set of all possible inputs traversing all of the code paths.

Many inputs can map to the same result. Measuring code coverage does not ensure that every input has the correct result, so do not depend on it entirely for catching bugs.

The Sober Reality of Exhaustive Testing

We now know that 100% code coverage doesn’t reliably catch all bugs, so what if we just generate all inputs to make sure that we exhaustively test our implementation code? Let’s revisit the function signature of isLegalMove along with the card and suit data types.

type suit =
  | Clubs
  | Diamonds
  | Hearts
  | Spades;

type card = {
  suit: suit,
  rank: int
};

let isLegalMove: (card, card) => bool;

What are all of the possible inputs that isLegalMove accepts? The type definitions provide a canvas for thinking about the number of possible values for each type. isLegalMove takes two cards, so let’s first quantify how many possible values of cards) there are. There are only 4 valid _suit values, 1 for each suit. For simplicity’s sake, let’s say that we are only running on 32-bit processors, so there are 2³² possible int values. The number of possible card values is then 4·2³² = 2³⁴ = 17,179,869,184.

Something isn’t right here — we know that there are only 52 cards in a deck. How can this number be so high? int is a very general data type, so what if we impose a more strict constraint on the card type:

type rank = 
  | Ace | Two | Three | Four
  | Five | Six | Seven | Eight
  | Nine | Ten | Jack | Queen | King;

type card = {
  suit: suit,
  rank: rank
};

A card rank can’t actually be any integer. There are only 13 possible card ranks (Ace through King), and we model that with the new rank type. Now, there are only 13 rank values, so there are 4·13= 52 possible card values. Not only does that more accurately model the real world, but it reduces the number of values that we have to think about as possible inputs by several orders of magnitude.

Yaron Minsky coined the phrase “make illegal states unrepresentable” for this type of data modeling in Effective ML[2]

Looking back at isLegalMove, we see that it takes in 2 cards as input. This means that it takes in all possible 2-card combinations, or more accurately, the number of 2-card partial permutations since order matters (there is a clear source and destination card). There are 52·51 = 2,652 of these permutations. We see a combinatorial explosion with the number of input combinations very quickly, even after we reduced the number of possible card values. Is there a more efficient way to test this behavior than running all 2,652 test cases?

For a great read on combinatorics, check out Intuition Behind Permutations and Combinations[3] by Shawn O’Mara.

Domain Analysis: The Art of Optimizing Input Combinations

So code coverage doesn’t even guarantee reliability, and input combinations explode combinatorially. What option do we have then? If we could identify some of the input combinations as redundant, that would allow us to use a far smaller number of them as test inputs. The trick is finding the right level of “redundancy,” and the process is more formally known as equivalence partitioning. Partitioning the full set of combinations into redundant groups (known as equivalence classes) requires thinking about the game rules as they’re stated. As a reminder:

Single cards can be moved between cascades if the source and destination cards have different colors and they are in descending order.

The first equivalence classes that come to mind are rank-pairs that are in descending order after moving, and rank-pairs that are not, e.g.

descendingRanks = { (Ace, Two), (Two, Three)… }
arbitraryRanks = { (Ace, Three), (Ace, Four), … (Two, Four), … }

Elements of these sets should be treated equivalently within isLegalMove, provided that the rank values are sequential. Let’s write a quick test for that property since there’s nothing in the type system that enforces it for the rank type:

let ensureRanksAreSequential= () => {
  module L = Belt.List;

  let allRanks = [
    Ace, Two, Three, Four,
    Five, Six, Seven, Eight,
    Nine, Ten, Jack, Queen, King,
  ];

  let values = L.map(allRanks, rankValue);

  let (allSequential, _) =
    L.reduce(
      values, 
      (true, 0), 
      ((allSeq, prev), v) =>
        (allSeq && v - prev == 1, v);
    );

  [
    assertEqual(
      ~expected=true,
      ~actual=allSequential,
      "Ranks must be sequential in order to have equivalence",
    ),
  ];
};

This depends on a new rankValue function:

let rankValue = r =>
  switch (r) {
  | Ace => 1
  | Two => 2
  | Three => 3
  | Four => 4
  | Five => 5
  | Six => 6
  | Seven => 7
  | Eight => 8
  | Nine => 9
  | Ten => 10
  | Jack => 11
  | Queen => 12
  | King => 13
  };

The concept of color is a little more tricky. The only equivalence classes that we can depend on are cards partitioned by suit. Keeping in mind that we can use any card from a suit in its place, the combinations of suits are:

suits = { h, s, d, c }

differentColorSuits = { (h, s), (h, c), (d, c), (d, s) }

sameColorSuits = { (h, h), (h, d), (d,d), (c, c), (c, s), (s, s) }

Zooming out, let’s see how these different partitions interact:

The suit pairs fall into two broad categories: Different Color Pairs and Same Color Pairs. For each suit-pair, the ranks of the cards can be either descending or arbitrary (Desc / Arb), resulting in four subdomains of interest:

Descending different color pairs
Arbitrary different color pairs
Descending same color pairs
Arbitrary same color pairs

We can now choose card values by selecting values from the equivalence classes that meet both constraints of each subdomain:

Descending different color pairs

(2h, 3c)
(2h, 3s)
(2d, 3s)

Arbitrary different color pairs

(4c, 2d)
(2d, 5s)

Descending same color pairs

(2h, 3h)
(4c, 5s)

Arbitrary same color pairs

(3s, 5s)
(3c, 5s)

And we write a test that tests these pairs:

let testDomainAnalysis = () => {
  module L = Belt.List;

  let twoOfHearts = {suit: Hearts, rank: Two};
  let threeOfHearts = {suit: Hearts, rank: Three};  
  let threeOfClubs = {suit: Clubs, rank: Three};
  let fourOfClubs = {suit: Clubs, rank: Four};
  let twoOfDiamonds = {suit: Diamonds, rank: Two};
  let threeOfSpades = {suit: Spades, rank: Three};
  let fiveOfSpades = {suit: Spades, rank: Five};

  let descendingDifferentColor = [
    (twoOfHearts, threeOfClubs, true, "2h -> 3c is legal"),
    (twoOfHearts, threeOfSpades, true, "2h -> 3s is legal"),
    (twoOfDiamonds, threeOfSpades, true, "2d -> 3s is legal"),   
  ];

  let arbitraryDifferentColor = [
    (fourOfClubs, twoOfDiamonds, false, "4c -> 2d is not legal"),
    (twoOfDiamonds, fiveOfSpades, false, "2d -> 5s is not legal"),
  ];

  let descendingSameColor = [
    (twoOfHearts, threeOfHearts, false, "2h -> 3h is not legal"),
    (fourOfClubs, fiveOfSpades, false, "4c -> 5s is not legal"),
  ];

  let arbitrarySameColor = [
    (threeOfSpades, fiveOfSpades, false, "3s -> 5s is not legal"),
    (threeOfClubs, fiveOfSpades, false, "3c -> 5s is not legal"),    
  ];

  let testTable =  L.flatten([
    descendingDifferentColor, 
    arbitraryDifferentColor, 
    descendingSameColor, 
    arbitrarySameColor
  ]);

  L.map(testTable, ((source, dest, expected, msg)) =>
    Bool.assertEqual(
      ~expected, 
      ~actual=isLegalMove(~source, ~dest), 
      msg
    )
  );
};

Running this set of tests results in one failure: 4c → 5s is not a legal move. Instead of playing whack-a-mole and adding tests once bugs arose in production, we analyzed the full input domain and partitioned it into discrete subdomains. Creating tests from these subdomains caught a bug that we weren’t thinking about, an unknown unknown. The key to this kind of thinking is that the subdomains represent real scenarios of the game itself — independent of any code or implementation. If we were to play a game of FreeCell, you might actually try and move a 4 of clubs onto a 5 of spades, and the code must handle that case correctly. These test cases push the implementation to handle this real world scenario when code coverage metrics cannot.

Partitioning a large input domain into subdomains of equivalent values can expose unknown bugs without the need for exhaustive testing.

Elephants

Doing this kind of analysis is interesting and powerful, but is it necessary to apply to a CRUD form? The answer (like all answers) is: it depends. We should always ask this question of ourselves, and I particularly like how Rex Brown thinks about it in Pragmatic Software Testing[4], where he outlines his process of “quality risk analysis” that boils down to:

Test where the odds favor the highest benefit. Finding likely bugs is beneficial. Finding dangerous bugs is beneficial. The more likely or dangerous any particular group of potential bugs is, the more time and money you should invest in looking for them.

If you’re experimenting, discovering product market fit, or working in a less risky area of the product, by all means, just ship it! I’m a huge proponent of iteration speed myself, so it’s important to weigh the cost of thorough testing. This requires honest reflection about the project and understanding what is and is not “important” in the current business context. However, you probably want to thoroughly test the parts of your applications that are absolutely essential to your business. You also want to spend more time preemptively finding bugs in areas where failure is very costly, like data security or money transactions. In these cases, the up front investment is almost certainly worth it.

We are also frequently testing much more than pure functions of business logic. We’re building asynchronous UIs communicating with horizontally scaling servers that in turn communicate with databases. Does domain analysis help with that? I think these systems are harder to reason about in the same way as a game of FreeCell, but the input domains are still there. Starting to think this way can also get us to design our code differently so that we can reason about the system in this way.

No matter what, thinking about the possible inputs to a system instead of just existing code paths is a necessary mental shift, as we showed when we exposed a bug in an implementation that had a test suite with 100% code coverage. Effective tests find scenarios where the ultimate goal of an application is not met. When we think about inputs, we think more about this goal and not just the current shape of code.

References

Code examples

[1] Towards a Theory of Test Data Selection (Goodenough, Gerhart)
[2] Effective ML (Minsky)
[3] Intuition Behind Permutations and Combinations (O’Mara)
[4] Pragmatic Software Testing (Brown)

Top comments (5)

Yawar Amin • Apr 13 '20

Hi Alex, this is a great write-up. The natural next step is an isLegalMove implementation using case analysis, that shows the power of pattern matching at work. Do you have something in the works?

Alex Weisberger • Apr 14 '20

Thanks for replying.

I chose to leave out the implementation change that fixes the bug because I thought it was pretty simple and wanted to be conscious of the post length. But here it is in the example code repo.

Or were you thinking that pattern matching should be used for something else?

Yawar Amin • Apr 14 '20

I had a little fun implementing the function directly partitioning the inputs by case:

let isLegalMove = (source, dest) => switch (source, dest) {
  | ({suit: Clubs | Spades, rank: sourceRank}, {suit: Hearts | Diamonds, rank: destRank})
  | ({suit: Hearts | Diamonds, rank: sourceRank}, {suit: Clubs | Spades, rank: destRank}) =>
    rankValue(sourceRank) + 1 == rankValue(destRank)
  | _ => false
};

Alex Weisberger • Apr 14 '20

That does show the power of pattern matching. You can capture a lot of permutations in a relatively small amount of code this way.

I'm partial to having descriptive names for operations, so I like how the current isLegalMove reads like English: areCardsDifferentColor && areRanksInDescendingOrder reads exactly like the game rules are stated.

That can remain the same and we can use this technique to reimplement areCardsDifferentColor:

let areCardsDifferentColor = (c1, c2) => 
  switch (c1.suit, c2.suit) {
  | (Clubs | Spades, Hearts | Diamonds) => true
  | (Hearts | Diamonds, Clubs | Spades) => true
  | _ => false
  };

It's definitely fun to think about the best way to express things.