Kurt

Posted on Jun 10, 2022 • Originally published at getcode.substack.com on May 12, 2022

Property-based testing #2: The Essentials of Vintage QuickCheck

#testing #python #programming

This second post in the series goes through the design and implementation of the original property-based testing library, QuickCheck. The first and introductory post is Property-based testing #1: What is it anyway? Even if you already know what property-based testing is, it's worth familiarizing yourself with the example.

The complete code can be found on GitHub - in particular example.py and vintage.py.

Photo by Timothy Dykes on Unsplash

Last time we looked at why you'd want to write property-based tests, and introduced the basic functionality that a property-based testing library should provide. Now, we'll dive into how a property-based testing library with random generation is implemented - and more in particular the approach set out by the OG, QuickCheck. We'll create a mini reference implementation in just a couple pages of code (including comments and examples)!

At its core, any modern PBT library provides three related modules.

The first allows defining and combining generators - a few functions to create generators for primitive types, ways to skew or restrict generated values, and combine them to generate custom types. For example, to generate the age field of Person objects, you'd want to restrict to positive values up to say 100. To write a generator for Person objects, you'd want to combine the generator for age and the generator for names. I'll call this the generator module.

The second allows defining properties, which means specifying some parametrized test, and specifying which generator you'd like to use for it. It also includes ways to observe and log the distribution of generated values so you can convince yourself the generated values are sensible. I'll call this the property module.

The third and last allows running a series of tests based on some configuration, e.g. the number of test cases to generate, how long the test should be running, and so on. I'll call this the runner module.

As a sneak peek ahead, at the end of this post we'll be able to define generators, and run tests for the sort_by_age property we discussed in the last post. Recall that Person is defined as:

@dataclass(frozen=True)
class Person:
    name: str
    age: int

We'll be able to create a generator for random Person objects as follows:

ages = int_between(0,100)

letters = map(chr, int_between(ord('a'), ord('z')))

simple_names = map("".join, list_of_length(6, letters))

persons = mapN(Person, (simple_names, ages))

lists_of_person = list_of(persons)

These are five random generators for random ages, lowercase letters, simple names consisting of 6 random lowercase letters, person objects and lists of person objects respectively. We'll implement and discuss the int_between, map, mapNand list_of functions below - they're part of the generator module of our library.

We'll also be able to define properties though the for_all function:

def is_valid(persons_in: list[Person], persons_out: list[Person]) -> bool:
    same_length = len(persons_in) == len(persons_out)
    sorted = all(persons_out[i].age <= persons_out[i + 1].age
                for i in range(len(persons_out)-1))
    unchanged = { p.name for p in persons_in } == { p.name for p in persons_out }
    return same_length and sorted and unchanged

prop_sort_by_age = for_all(
    lists_of_person,
    lambda persons_in: is_valid(persons_in, sort_by_age(persons_in))
)

Finally, through the test function, we'll check that the property holds for 100 randomly generated lists:

> test(prop_sort_by_age)
Success: 100 tests passed.

All things random

Let's first focus on how to create a generator API - as we'll see the two other modules are entirely based on it.

The first question is - why bother at all and not just use the built-in random API? Most languages have one, here's Python's random module.

The main reason is abstraction - at this point it may look like premature abstraction, but in the following posts we'll change the implementation of generators quite a few times, to add features and change their behaviour, and we won't have to change the users of the API. This is powerful, both from the user and the maintainer's perspective - the maintainer has considerable freedom to improve, without bothering users too much. One straightforward example is changing to a different pseudo-random number generator that is faster or has better properties.

Another reason is that typical built-in random APIs are quite poor - Python's has more features than some other languages, but still, writing the above Person generator would require quite a bit of plumbing.

Without further ado, let's first take a stab at what the representation of a generator should be. Since a generator is something that when asked, can produce a new random value, the simplest representation is a function without arguments that returns a value. To make the code more readable and safe, we wrap it in a small class:

class Random(Generic[Value]):
    def __init__ (self, generate: Callable[[], Value]):
        self._generate = generate

    def generate(self) -> Value:
        return self._generate()

Pretty straightforward - Random is a class that just wraps a function that generates a value of some generic type. To make what follows visible, let's first implement a simple function to sample the actual values from a generator - even though we don't have any to sample yet!

def sample(gen: Random[T]) -> list[T]:
    return [gen.generate() for _ in range(5)]

sample takes a Random instance and calls generate() on it 5 times, collecting the results in a list.

With that in place, we'll now build up increasingly powerful functions that allow us to create and combine Random generators, so we can ultimately generate a list of Person values - which is what we need to be able to check the properties of sort_by_age.

Let's write the simplest generator possible - one that always generates the same value.

For each new function, I'll first give the implementation with a usage example. Then I'll show the output of sampling the generator.

def constant(value:T) -> Random[T]:
    return Random(lambda: value)

pie = constant(math.pi)

constant takes some value as argument, and creates a Random instance which always generates that value:

> sample(pie)
[3.141592653589793,
 3.141592653589793,
 3.141592653589793,
 3.141592653589793,
 3.141592653589793]

Amazing. Now slightly more exciting.

def int_between(low: int, high: int) -> Random[int]:
    return Random(lambda: random.randint(low, high))

ages = int_between(0,100)

int_between creates generators for integers within a range. It's a straightforward wrapper around Python's random.randint. Let's use it to generate reasonable ages for people:

> sample(ages)
[37, 15, 1, 62, 21]

That's one field done, now how about generating names for Person? Generating random letters seems like a good start. Since we already know how to generate integers, and it's straightforward to turn an integer into a letter, we should be able to build on int_between. It'll be generally useful to be able to convert, or map, generated values to another type. This idea is analogous to using the built-in function map for iterables. So let's add map to our arsenal, and use it to write a lowercase letter generator.

def map(f: Callable[[T], U], gen: Random[T]) -> Random[U]:
    return Random(lambda: f(gen.generate()))

letters = map(chr, int_between(ord('a'), ord('z')))

map takes a function and a Randominstance, and then applies the function to each value of the underlying generator.

> sample(letters)
['t', 'm', 's', 'u', 'c']

That's great, but a single letter doesn't make for a believable name. We could map again and repeat the same letter a number of times, but I think you'll agree that zzzzz is not exactly an interesting name either. What we'd like to have is a method that runs a given generator N times, and combines the result.

The answer: mapN. mapN applies a function to N generators, like map applies a function to a single generator (so it's actually more general than what we need for this particular example). A related function you may be more familiar with is zip, which combines the inputs into a tuple.

def mapN(f: Callable[...,T], gens: Iterable[Random[Any]]) -> Random[T]:
    return Random(lambda: f(*[gen.generate() for gen in gens]))

Given mapN we can write a list_of_length function, which generates a list of a given length, where each element is generated from a given Random generator:

def list_of_length(l: int, gen: Random[T]) -> Random[list[T]]:
    gen_of_list = mapN(lambda *args: list(args), [gen] * l)
    return gen_of_list

With all that we're finally able to write a somewhat reasonable Person generator - thanks to mapN again, now used in its generality:

simple_names = map("".join, list_of_length(6, letters))
persons = mapN(Person, (simple_names, ages))

> sample(persons)
[Person(name='vuaufc', age=38),
 Person(name='kmvmpy', age=34),
 Person(name='wiuoph', age=34),
 Person(name='cvuxbu', age=72),
 Person(name='fykqmn', age=33)]

Pleased to meet you, Mr vuaufc!

Okay, perhaps these are not the most believable names - but they will do for our purposes, which is to eventually test sort_by_age.

As the final piece of the puzzle, we'd like to generate lists of Person. For the moment, we can only generate lists of a given length, using list_of_length. Without breaking abstraction (e.g. by using random directly) we can't yet generate a list of random length with random elements. That's because we have no way to write a generator that's dependent on the random value generated by another generator - we simply have no way to put them together. (Try to write it with only the functions so far to convince yourself, if you like)

This is where the following function helps:

def bind(f: Callable[[T], Random[U]], gen: Random[T]) -> Random[U]:
    return Random(lambda: f(gen.generate()).generate())

Before going into the details of its implementation, here's how you use it:

def list_of(gen: Random[T]) -> Random[list[T]]:
    length = int_between(0, 10)
    return bind(lambda l: list_of_length(l, gen), length)

lists_of_person = list_of(persons)

> sample(lists_of_person)
[[Person(name='rsgiud', age=81), Person(name='ywcelx', age=55)],
 [Person(name='bskmxa', age=99), Person(name='jaonqc', age=55),
  Person(name='igelts', age=67)],
 [Person(name='yuhtkg', age=78)],
 [Person(name='coessh', age=43), Person(name='fpjnqi', age=18),
  Person(name='fjojvw', age=86), Person(name='tbneue', age=97),
  Person(name='ejirnj', age=76)],
 [Person(name='wopfbo', age=22), Person(name='yjgfkf', age=39),
  Person(name='xsslyn', age=34), Person(name='kjpqus', age=12),
  Person(name='cjlldt', age=24), Person(name='bqthxy', age=29),
  Person(name='ttqlhx', age=20), Person(name='wyumfv', age=99),
  Person(name='ucdeqz', age=14), Person(name='eqrxir', age=41)]]

Note there are 5 lists that each have a different, randomly chosen length.

How does bind work? It first runs the given generator, and feeds the resulting value into f. f takes the value as an argument and returns a new generator, which is then run in turn. The fact that the user can write the f function and is given a value from gen, allows expressing the dependence we need - we can now generate a random list length, and then use bind to plumb that into a generator of a random list with fixed length.

Before taking a step back to look at what we've got in our generator module - there's one implementation subtlety in bind that is worth mentioning. It's tempting to think that the implementation can be simplified to:

def bad_bind(f: Callable[[T], Random[U]], gen: Random[T]) -> Random[U]:
    # let's get rid of the lambda
    return Random(f(gen.generate()).generate)

Unfortunately that doesn't work - it only generates from gen and applies f once, which means in the example it would create a generator that chooses a random length once, and then forever returns random lists of that length.

It's maps all the way down

With a small but beautiful arsenal of functions - int_between, constant, map, mapN and bind, we can generate the list of person objects we need. With small additions like choice (which you can find in the source code, implemented in terms of bind) and filter this handful of functions is the sufficient and necessary core of any generator module.

As we touched on in the last post, these functions are sometimes called "combinators" because they are designed to be combined without limit. All of the functions return a Random object, and so can be used again as an argument of a further combinator - another way of saying that is that they are infinitely composable. That makes it a beautiful and powerful approach to API design.

This may have all felt quite familiar if you've used collections APIs before - most of the generator functions have corresponding built-in python functions that work on collections (iterables, more precisely).

int_between is like range
mapis like map
mapNis like zip+ map
bindis like a nested forloop in a list comprehension

This is perhaps not a surprise - generators are infinite streams of random values, and so any API that can work with a container of values of some sort can support similar operations. If Python would allow changing how generator expressions are processed, we'd be able to write random generators like:

# not actually Python
lists_of_person = gen (
    pl
    for length in int_between(0, 10)
    for pl in list_of_length(length, persons)
)

Which can be automatically translated into the version with bind (which actually works!) above - illustrating that for x in generator ... is equivalent to bind(lambda x: ..., generator).

An interesting aspect is that the sequence of functions map, mapN and bind gives the user an increasing amount of expressive power. One way to understand this is that map and mapN can be implemented in terms of bind, but bind can not be implemented in terms of map and mapN. The same is true for map vs mapN - if you have mapN you have map, but not vice versa.

Why not just implement bind then? Indeed we could do that without any loss of functionality or expressivity. The flip side is that bind exposes strictly less information to the implementation and so bind can make fewer assumptions about what's intended compared to map and mapN. This is reflected in the relative complexity of their implementations. As a result typically bind is less performant than mapN which is less performant than map - if you can write it with map, don't use bind to give the implementation more to work with.

This pattern of functions - constant, map, mapN and bind - returns time and time again in various guises in collection libraries, streams, reactive libraries and the like. In languages with expressive type systems like Haskell explicit abstractions or interfaces exist to encapsulate them - Functor, Applicative and Monad respectively. The concepts also have a deep connection to category theory, which is a branch of mathematics that Eric Meijer describes as "the Mathematicians' interpretation of interface-based design".

We can only scratch the surface of that here - hopefully you now have an intuitive idea of what these "API design patterns" entail with some leads to learn more if you're interested. Many libraries besides PBT use these patterns and recognizing them makes it easier to get started. As a library designer, it's definitely useful to know and learn when these patterns can be used - they're a valuable tool in your toolbox.

If you've made it this far, now would be a good time to reward yourself with a slice of pecan pie!

Photo by Levi Guzman on Unsplash

Let me write some tests already

Now that we can write random generators, let's go the final distance and write a few functions to define properties and run some tests.

First, we need a way for the user to write a property. A property is just a combination of a generator and an assertion. We'll run the assertion 100 times with randomly generated values.

The function to connect a generator to an assertion is typically called "for all", because this is reminiscent of the logical for all quantifier. In prose, we want to write "For all lists of person, the result of sort by age is valid." In some sense "for all" is a misnomer, as of course we won't really be testing or proving anything for all values, but let's stick with it anyway.

For simplicity we'll write assertions as functions that return True if the assertion holds, False if not. More traditionally you'd use assert and fail the test on exception, but doing it this way allows us to build up the implementation in an easier to understand way.

Our first attempt at defining a property, is that it's a boolean generator indicating whether the test has failed or not. That means we don't even need a new class - we can just use Random[bool]:

Property1 = Random[bool]

def for_all_1(gen: Random[T], property: Callable[[T], bool]) -> Property1:
    return map(property, gen)

sort_by_age_1 = for_all_1(lists_of_person, lambda persons_in: is_valid(persons_in, sort_by_age(persons_in)))

# wrong_sort_by_age is sort_by_age with a bug,
# to show what a failing test looks like.
wrong_sort_by_age_1 = for_all_1(lists_of_person, lambda persons_in: is_valid(persons_in, wrong_sort_by_age(persons_in)))

Our first attempt at implementing for_all is just a map, applying the assertion to each generated element. Since a property created by for_all is just a Random instance, we can sample it:

> sample(sort_by_age_1)
[True, True, True, True, True]

> sample(wrong_sort_by_age_1)
[False, True, False, False, False]

That's not user-friendly output though. In the actual test function, we generate 100 values - if they're all True, our test has succeeded:

def test_1(property: Property1):
    for test_number in range(100):
        if not property.generate():
            print(f"Fail: at test {test_number}.")
            return
    print("Success: 100 tests passed.")

In action:

> test_1(sort_by_age_1)
Success: 100 tests passed.

> test_1(wrong_sort_by_age_1)
Fail: at test 0.

Okay that seems to work, but it doesn't allow writing properties that draw from two or more generators - it would be nice if we could nest for_all. Here's for_all_2 which allows that:

def for_all_2(gen: Random[T], property: Callable[[T], Union[Property1,bool]]) -> Property1:
    def property_wrapper(value: T) -> Property1:
        outcome = property(value)
        if isinstance(outcome, bool):
            return constant(outcome)
        else:
            return outcome
    return bind(property_wrapper, gen)

sum_of_list_2 = for_all_2(
    list_of(int_between(-10,10)), lambda l:
        for_all_2(int_between(-10,10), lambda i: 
            sum(e+i for e in l) == sum(l) + len(l) * i))

We allow the property function to return either a bool value (in which case it's the innermost for_all) as before, or a Property1 (in which case it's one of the outer for all functions). Because we may now get a Property1 back, which is a Random[bool], it's not enough to use map: we have to upgrade to bind in the implementation of for_all_2.

As a final improvement let's print the generated arguments if the test fails. We'll use the same trick as before by piggy-backing on Random, but instead of bool use a richer type TestResult:

@dataclass(frozen=True)
class TestResult:
    is_success: bool
    arguments: Tuple[Any,...]

Property = Random[TestResult]

def for_all(gen: Random[T], property: Callable[[T], Union[Property,bool]]) -> Property:
    def property_wrapper(value: T) -> Property:
        outcome = property(value)
        if isinstance(outcome, bool):
            return constant(TestResult(is_success=outcome, arguments=(value,)))
        else:
            return map(lambda inner_out: replace(inner_out, arguments=(value,) + inner_out.arguments), outcome)
    return bind(property_wrapper, gen)

def test(property: Property):
    for test_number in range(100):
        result = property.generate()
        if not result.is_success:
            print(f"Fail: at test {test_number} with arguments {result.arguments}.")
            return
    print("Success: 100 tests passed.")

This is more user-friendly when the test fails:

> test(prop_wrong_sort_by_age)
Fail: at test 0 with arguments ([Person(name='qqdkng', age=31), Person(name='lcqyva', age=73)],).

At least it now tells us which arguments are a failing input! With that it's straightforward to reproduce and debug the error.

Note that for mutable values, capturing the arguments in this way can be problematic: TestResult.arguments captures a reference to the generated value, but if the assertion or the function under test changes it, we'll print the value after the change. For a pure functional language like Haskell that can't happen, but for Python and others it's definitely a problem. One neat way to deal with this is to store the pseudo-random number generator's seed instead of the generated values, and then re-generate the value from scratch with that seed. We'll come back to issues around mutability in the next posts.

That's as far as I want to take this initial implementation. Let's take stock of what we have, what's missing, and what we'll cover in the next post.

Finally

The 80/20 rule applies here - this code, which altogether is just a couple pages, represents 80% of the functionality of a property-based testing library. A lot of the work goes into writing and maintaining a wide range of useful generators for commonly used types.

I've definitely simplified and omitted a number of aspects though, only some of which we'll come back to.

The property and test "modules" here only have one function each (for_all and test respectively). Real PBT libraries have more functions to gather statistics and info on generated values, and this is implemented by adding more fields on TestResult to collect them. For example, we might print information like "20% of the generated lists were empty, generating values took 12% of the test time" and so on. Also there will be a bunch more options to configure the test run in various ways.
QuickCheck was written in Haskell which is a pure language and so QuickCheck's Random.generate needs to include the pseudo-random number generator's seed value as input. In the implementation above, that seed is hidden away in the random module, and updated as values are generated.
QuickCheck's generate also includes an integer size argument. This size is used to restrict the "size" of the generated values - which is not unambiguously defined, but it's typically used to make sure that generators for recursive types (like trees) don't end up generating huge values, by e.g. interpreting the given size as the maximum depth of the tree. We will come back to the notion of size when we discuss shrinking in the next few posts.

That's it for this post, a lot of ground covered! The next couple of posts will discuss shrinking - which is an area where most experimentation has happened since QuickCheck was introduced.

If you feel like it, please like and share this post, comment below or discuss on Twitter @kurt2001.

Until next time!