Kurt

Posted on Aug 1, 2022 • Originally published at getcode.substack.com on Jul 31, 2022

Property-based Testing #6: Random All the Way Down

#programming #python #testing

This is the sixth post in a series about property-based testing. This post describes "random shrinking", the third and last implementation of shrinking I know of. It keeps all the advantages of internal shrinking, which we discussed in part 5, and is much simpler.

The complete code for this post can be found on GitHub - in particular example.py and random_based.py.

Photo by Free Walking Tour Salzburg on Unsplash

The last posts discussed two different approaches to shrinking failing test cases, aiming to make the failing example easier to understand and debug. The first approach, direct shrinking, makes values smaller by directly changing them. The second approach, internal shrinking, instead tries to change the sequence of choices that are made during random generation. These choices are like the DNA of a generated value, and choices can be edited to make the resulting value smaller.

When we discussed the advantages of internal shrinking vs direct shrinking, we also noted that editing the choices can be tricky. Significant engineering effort is needed to make it work well. What if we could have our cake and eat it too - an approach to shrinking that keeps all the advantages of internal shrinking, but is much simpler to implement?

The unreasonable effectiveness of randomness

The idea is simple: instead of coming up with a deliberate algorithm to make values smaller, we just - in technical terms - throw shit to the wall and see what sticks. That is, we're already counting on random generation to find bugs, let's also randomly try to find smaller values that still make our test fail.

Unlike the shrinking strategies we've seen so far, which produce a limited number of smaller values to try, this new approach can keep trying forever. If the random generator is "random enough", a good shrink should show up at some point.

In practice, that can take too long, especially if the test itself is slow. However, we can avoid running a test, or even generating an actual value, by calculating the size of the value that we're about to generate beforehand. We'll keep the size of the smallest value that fails the test so far, and then randomly generate another and estimate its size while we generate. If that size is greater than that of the current smallest value, we're not interested in running the test with it, or creating the rest of the value. Instead we'll make the generator short-circuit - basically abort and move on to the next value. This avoids a lot of useless work.

I learned of this beautiful idea via CsCheck by Anthony Lloyd, and to the best of my knowledge he is also the inventor of it. As we'll see it's simple to implement - the shortest reference implementation of all - and it's effective. As with previous implementations, it's worth noting that CsCheck's approach is slightly more complex than what I'll describe here - I'm just capturing the main idea.

Matters of size

The implementation contains all the same concepts as the previous posts.

We make two modifications to Random. In addition to generating a random value, we also return the size of the generated value. For our purposes, the size is simply represented by a positive integer. The smaller the integer, the smaller the size. Additionally, to enable short-circuiting, the generator takes in the current minimal size, min_size.

Size = int

class SizeExceeded(Exception):
    pass

class Random(Generic[T]):
    def __init__ (self, 
        generator: Callable[[Optional[Size]], Tuple[T, Size]]):
        self._generator = generator

    def generate(self, min_size: Optional[Size] = None) -> Tuple[T, Size]:
        return self._generator(min_size)

If min_size is None, we're in the random generation phase of the test run - meaning we haven't found a failing test yet and so the generator should never short-circuit. If min_size is a Size value, we are shrinking, and min_size is the current smallest value for which the test fails. If a generator exceeds that size, there's no point in going on, and the generator should throw SizeExceeded to short-circuit.

Don't worry if that doesn't make sense yet - examples below.

Beginning with constant, as usual:

def constant(value:T) -> Random[T]:
    return Random(lambda: (value, 0))

A constant value has size 0 - no amount of random generation is going to make the constant value change, so if we have found a value with size 0 we can just stop shrinking.

Our old friend int_between is a bit more tricky.

def int_between(low: int, high: int) -> Random[int]:
    def zig_zag(i: int):
        if i < 0:
            return -2*i - 1
        else:
            return 2*i
    def generator(min_size: Optional[Size]):
        value = random.randint(low, high)
        size = zig_zag(value)
        # short circuit - implementation below
        dec_size(min_size, size)
        return value, size
    return Random(generator)

int_between uses so called ZigZag encoding to calculate the size of an int. ZigZag encoding is also used in Protocol Buffers. The gist of it is that each signed integer gets a positive size, by zigzagging between negative and positive integers. Unlike the implementation here, with a limited type like int32 or int64, you only need a few bit shifts and exclusive or to compute the size, which makes it pretty fast in practice. The idea is that integers with greater absolute value have greater size.

I've also introduced a dec_size method (for "decrease size"), which makes sure we short-circuit generation as soon as possible.

def dec_size(min_size: Optional[Size], 
             decrease: Size)
    -> Optional[Size]:

    if min_size is None:
        return None
    smaller = min_size - decrease
    if smaller < 0:
        raise SizeExceeded()
    return smaller

If min_size is not None, meaning we're currently shrinking, dec_size subtracts the size of the currently generated value from the current minimum size, and checks if that is still positive. If it's not, it throws SizeExceeded to indicate that the value we're about to generate is already greater than the best, minimum size - thereby short-circuiting. The less work we do, the more values we can generate in shorter time, and the more opportunities we have for finding smaller values. We're not yet using the result of dec_size but will do soon.

map is uninteresting - it just maintains the size:

def map(func: Callable[[T], U], 
        gen: Random[T])
    -> Random[U]:

    def generator():
        result, size = gen.generate()
        return func(result), size

    return Random(generator)

The size of mapN is the sum of the sizes of all its inputs:

def mapN(func: Callable[...,T], 
         gens: Iterable[Random[Any]])
    -> Random[T]:

    def generator(min_size: Optional[Size]):
        results: list[Any] = []
        size_acc = 0
        for gen in gens:
            result, size = gen.generate(min_size)
            min_size = dec_size(min_size, size)
            results.append(result)
            size_acc += size
        return func(*results), size_acc

    return Random(generator)

Both map and mapN assume that the size of the output value decreases as the size of the input values decreases. Reasonable, though it's easy to imagine a counterexample. Here's where the result of dec_size is used: as values for the inputs are generated, mapNsubtracts their size from min_size. This ensures that, if say we have 10 inputs, and we already exceed the min_size by the third generator, we short-circuit as soon as possible.

Finally, bind:

def bind(func: Callable[[T], Random[U]], 
         gen: Random[T])
    -> Random[U]:

    def generator(min_size: Optional[Size]):
        result,size_outer = gen.generate(min_size)
        min_size = dec_size(min_size, size_outer)
        result,size_inner = func(result).generate(min_size)
        size = size_inner+size_outer
        return result, size

    return Random(generator)

The size of bind is interpreted as the sum of the size of both inner and outer generators.

And that is pretty much it! We can literally reuse our implementation of Property and for_all, which I've repeated here as a reminder.

@dataclass(frozen=True)
class TestResult:
    is_success: bool
    arguments: Tuple[Any,...]

Property = Gen[TestResult]

def for_all(gen: Gen[T], 
            property: Callable[[T], Union[Property,bool]])
    -> Property:

    def property_wrapper(value: T) -> Property:
        outcome = property(value)
        if isinstance(outcome, bool):
            return constant(
                TestResult(
                    is_success=outcome, 
                    arguments=(value,)
                )
            )
        else:
            return map(
                lambda inner_out: replace(
                    inner_out, 
                    arguments=(value,) + inner_out.arguments
                ),
                outcome
            )
    return bind(property_wrapper, gen)

def test(property: Property):
    def find_smaller(min_result: TestResult, min_size: Size):
        ... # the only new bit

    for test_number in range(100):
        result, size = property.generate()
        if not result.is_success:
            print(f"Fail: at test {test_number} with arguments {result.arguments}.")
            find_smaller(result, size)
            return
    print("Success: 100 tests passed.")

Now, find_smaller is new but is hardly difficult. The idea is to keep generating random values until we find one that is both smaller than the current smallest known size, and that still fails the test:

def find_smaller(min_result: TestResult, min_size: Size):
    skipped, not_shrunk, shrunk = 0, 0, 0
    while skipped + not_shrunk + shrunk <= 100_000 and min_size > 0:
        try:
            result, size = property.generate(min_size)
            if size >= min_size:
                skipped += 1
            elif not result.is_success:
                shrunk += 1
                min_result, min_size = result, size
            else:
                not_shrunk += 1
        except SizeExceeded:
            skipped += 1

    print(f"Shrinking: gave up at {min_result.arguments}")
    print(f"{skipped=} {not_shrunk=} {shrunk=} {min_size=}")

We stop shrinking after trying 100,000 values, or if we somehow found the smallest possible size of 0. For each try, there are three possible outcomes:

"skipped". This means that during or after generation of the value, we detected that its size is greater than our current minimum size.
"shrunk". Successful shrink - we've found a smaller value that still fails the test.
"not_shrunk". Unsuccessful shrink - we've found a smaller value, but it passes the test.

Before we check how that works out in practice, let's take a break and have a healthy snack.

Photo by Robson Melo on Unsplash

Putting it to the test

Now the question is - does this even work? It seems naively simple. Only one way to find out - using our canonical prop_wrong_sort_by_age example. Note once again we didn't have to change generators or test code, and so I won't repeat that code here.

> test(prop_wrong_sort_by_age)
Fail: at test 0 with arguments ([Person(name='pjahgc', age=27), Person(name='gndrlt', age=70), Person(name='qukflk', age=79), Person(name='dknczu', age=2), Person(name='jqtqgr', age=8), Person(name='xlcxhk', age=22), Person(name='wotanl', age=5), Person(name='nxkupy', age=99), Person(name='pxngky', age=31)],).

Shrinking: gave up at arguments ([Person(name='etkdgb', age=8), Person(name='haagjh', age=5)],)

skipped=81616 not_shrunk=18373 shrunk=12 min_size=2502

> test(prop_wrong_sort_by_age)
Fail: at test 0 with arguments ([Person(name='cigepg', age=69), Person(name='lqkqmp', age=100), Person(name='nlgbbl', age=33), Person(name='xrnzfq', age=76), Person(name='ujjnfz', age=34), Person(name='xlcvxf', age=19)],).

Shrinking: gave up at arguments ([Person(name='adcyxh', age=1), Person(name='hrclbb', age=0)],)

skipped=81931 not_shrunk=18055 shrunk=15 min_size=2530

Each of these examples took a couple of seconds to generate. The second example is one of the best results I've seen, the first is more typical. A few notes:

It's alive! I've only tried a dozen or so times, but it always manages to shrink to a list of two Persons, with relatively low ages.
This approach has more problems with shrinking lists of letters - it's not even clear that there is a pseudo-random seed that generates the letters "aaaaaa" and "aaaaaab" in succession, which is what would have to happen for those "minimal" names to show up.
The stats in the last line of each example are typical. skipped happens about 80% of the time, not_shrunk about 20% of the time. The number of shrunk values is almost insignificant! These numbers are similar if I try 1,000,000 times instead of 100,000 times. In fact, the number of successful shrinks remains around 10-15 even with 10x more tries. This indicates that for this case at least, it's unlikely to help much if we keep shrinking for longer.

Picking a winner

This third and last approach to shrinking sounds almost too simple to work. But it does, and pretty well at that. We've removed all human cleverness from the shrinking process, and as a result we gain a straightforward implementation, while keeping an integrated generation and shrinking API. Shrinking works just as well for mutable as for immutable values, and we can reproduce any run with just a single seed value for the pseudo-random number generator (typically a couple int64 values).

Now, I've emphasized that this approach is by far the simplest to implement, and simplicity is its own reward. But does it also buy us anything? Yes it does - besides having all the advantages of internal shrinking, it's also trivial to parallelize. All you need is a pseudo-random number generator that can generate several separate streams of random numbers. Secondly, if you have the code and the random seed, it's easy to pause and resume the shrinking process at a later time.

After all that, you may be left with a question: what's the best shrinking strategy? As usual, I would say "it depends", and rephrase the question: if you're writing a property-based testing library from scratch, which approach should you choose?

The answer to that question is random shrinking, in my view. Faced with such decisions, it's tempting to find the "best" solution, by some measure. Instead, I think it's better to find the solution that gets people something useful, in the least amount of time, and with the highest optionality - meaning, while keeping as many options open as possible.

In this case, random shrinking is certainly going to produce something useful (i.e. it will make counterexamples smaller). It is by far the simplest and thus fastest solution to implement. And finally, we can move to another approach without bothering our users much, which keeps our options open. Also, it's feasible to tack on any of the other shrinking methods after random shrinking: for example, we can try to find the smallest example using random shrinking, and then try and make that value even smaller by any of the two other methods.

Finally

Phew! When I started this series I didn't think I'd get to six posts on property-based testing, and the worst is that I'm not even done - for example there's approaches that allow exhaustive generation of values as input to a property-based test. That said, I'm going to take a break (of currently unknown duration) from writing about property-based testing, because I have a bunch of ideas for other topics I'm excited to write about.

Until next time, and as always, happy to hear from you in the comments or on Twitter @kurt2001.