DEV Community

Stefan Lemvig Frederiksen for IT Minds

Posted on

Introduction to Property-Based Testing via Rust

Introduction

I had the pleasure of writing a compression algorithm as part of my master thesis. As a part of that, I wondered how to ensure that the algorithm and associated data structures were correct. My usual approach was to write unit tests to cover enough cases to ensure that I could trust my implementation. But I knew I would have a concern about having missed a possible combination of inputs that would wreck everything, and in turn, invalidating all of my work. Working alone, I wanted to avoid duplicate work whenever possible to optimize my time. Having to continually return to the implementation specifics if I find an error down the road was something I wished to avoid. So how do we tackle a problem like this?

Taking a step back from my use case, let us talk about conventional unit testing.

What is Unit Testing?

Unit testing is a static scenario describing a desired behavior or side effect of the code. You define the scenario with the classical Arrange-Act-Assert pattern. The arrange step initializes the setup, the act step executes the target behavior, and finally, the assert step verifies the desired outcome or side effects. But the concern about possible untested inputs remains. Are enough test cases and variations in the inputs covered to trust the correctness of the implementation? A way to work around this is to check code coverage, but that indicates untested code rather than a measure of how well tested the code is. Having too high code coverage often leads to testing unnecessary scenarios or makes the code too rigid to change.

Introducing Property-Based Testing

This is where Property-Based Testing (PBT) comes into play. Using randomized inputs, it can set up a non-deterministic scenario that can then be tested. But this introduces another problem. With unit testing, we could know that a certain input should give a certain output. But how can we define this output if the input is now randomized? This is where it gets a bit more abstract in that you cannot inherently define the expected output but rather certain properties between the input and output. According to David Thomas and Andrew Hunt in "The Pragmatic Programmer, 20th anniversary edition", these properties can be seen as code contracts or invariants. An invariant is, by definition, "a function, quantity, or property which remains unchanged when a specified transformation is applied." (Definition from Oxford Languages). If you were to implement a sorting algorithm, you would expect there to be as many elements in the input as in the output. This doesn't change after sorting and is thus an invariant. To simplify, assume that we are just sorting a sequence of integers, int[] list. Any index i in this list (constrained by 0 <= i < list.Length - 1) should ensure that list[i] <= list[i + 1]. This behavior should always hold after a list has been sorted and is then a contract of the code. The clever part about defining it this way is that it is not restricted by a specific input but rather fluent enough to work on any given input containing two or more elements (a sequence of one element is always sorted).

While it can always be helpful to find properties, some can prove entirely unhelpful in certain situations. For example, in a language like C# where reference types can be null, a property could be that a function never returns null. But this is a considerable amount of extra work, just to verify that null is never returned. This being said, any additional properties we find on the code will only strengthen our confidence in the correctness of the implementation.

Back to My Use Case

So after looking at what PBT is, how can this help me? I was implementing a compression algorithm called Relative Lempel-Ziv [1], which uses a trick to substitute part of a string with two numbers, where one will be the starting index of a reference string, and the second one is the length of this substring. Without going into details as this would shift the focus away from PBT, finding this substring efficiently, requires either a Suffix Tree or a Suffix Array. In my case, I went with the suffix tree.

This is where I started my work and wanted to use PBT because if my suffix tree didn't find the longest substring, the entire compression wouldn't be as efficient as it could have been. So, without going into the nitty-gritty details of how it works or how it is constructed [2], I had to find either an invariant or a contract of the suffix tree.

Since a suffix tree is essentially a compact method of outlining every possible suffix of a string, it will be bound by the length of the string. Thus, the number of suffixes in a string is exactly the number of characters it contains (+1 if you consider the empty string as a suffix). This is an invariant.

Implementing the Properties

I was implementing all of this in Rust, so I found a PBT framework for Rust and decided to give it a go. I had a Rust library named suffix_tree, containing all the code I needed to create a suffix tree from a given input string. For simplification, I will spare you the implementation details of a suffix tree, but if you are interested, it can be found here. I wrote it quickly and for my specific use-case, so it could use some cleaning up and better designs here and there.

#[quickcheck]
fn quickcheck_amount_of_suffixes_is_len_plus_one(s: String) -> bool {
    SuffixTree::new(&s)
         .suffixes()
         .count()
         == s.len() + 1
}
Enter fullscreen mode Exit fullscreen mode

Assume here that .suffixes() is meant to return an iterator, which yields all the suffixes of the suffix tree [3].

Now for a contract of the suffix tree. We expect all suffixes of a string to be contained in the suffix tree. This can be done quite elegantly in Rust via a range expression, a concept also prominent in other languages such as Python, C#, and more. Here is the entire test.

#[quickcheck]
fn quickcheck_contains_all_suffixes(s: String) -> bool {
let st = SuffixTree::new(&s);
    for i in 0..s.len() {
        if !st.contains_suffix(&s.as_bytes()[i..]) {
            return false;
        }
    }
    true
}
Enter fullscreen mode Exit fullscreen mode

There are a few caveats of this implementation [4], but I think the test here is short, clear, and powerful. I can now trust that my suffix tree will always contain all of the expected suffixes. By combining the invariant and the contract here, I can then deduce that no false-positive suffixes will be returned and that I can trust this implementation to be correct in how I intended it.

So we looked at some properties being checked on the suffix tree, and I hope you learned a bit by reading this post. My implementation has additional property-based tests for the compression algorithm as well. So if you decide to take a look at that, I suggest you first try to work out some properties that might exist on a lossless compression algorithm.

Interested in Trying It Out for Your Favorite Language?

Check out QuickCheck in Every Language, which lists the QuickCheck (the first PBT framework) implementations across a large number of programming languages. While not extensive, it provides a good start if you're lost. Otherwise, Google is your friend.

Footnotes

[1] It is a variant of Lempel-Ziv 77, but instead of taking a sliding window to compress with, it uses a single reference string.
[2] I used Ukkonen's algorithm.
[3] In my implementation, it is not as elegant, as this wasn't a big part of the project, but rather something that had to be made to efficiently implement the compression. It does not implement an iterator called .suffixes() on the suffix tree, but rather takes all the nodes directly and filters out any that are not leaves and then counts those. These are the ones that will be suffixes by design.
[4] A few caveats of this implementation are specific to Rust, like &s.as_bytes()[i..]. Indexing into a String isn't directly possible, as UTF-8 strings are variable-length encoded, and this operation is not O(1) as you might otherwise expect. As a result, the standard library does not expose this functionality directly, as it might behave slower than you would otherwise expect. Instead, you can still access, in constant time, the underlying bytes of the string, which covers the as_bytes() part. The second part of [i..] gets the string from the i'th index until the end, which can also be said to be the i'th suffix of the string.

Resources and Further Reading

The Enterprise Developer from Hell, I highly recommend reading at least this blog, because it really takes its time in introducing the concept and the why
Choosing properties for property-based testing, continuation of above
Generating test cases so you don’t have to, a property-based testing experience from Emil Eriksson, a Spotify engineer
Intro to Property-Based Testing
Step by Step Toward Property Based Testing
What is Property Based Testing?

Book Referenced

The Pragmatic Programmer 20th-anniversary edition, by David Thomas and Andrew Hunt - Chapter 7, topic 42

Top comments (0)