DEV Community

loading...

Proper Sets

entomy profile image Patrick Kelly ・4 min read

A while ago I wrote about how ISet<T> should be considered harmful. While criticisms are absolutely valid, it's even better if I can show a better solution that avoids these problems.

Enter Set<TElement>. I'm still not 100% sure on the public API, and this whole project is still in beta anyways, so if you've got a better idea on how to do something I show off, please let me know so I can look into it.

First off, there is no interface. Rather, this is an abstract type from which you can derive, but has internal derivations within the library that you can't directly access but will wind up working with. I promise this makes sense in practice.

Instead of the typical way I see sets implemented, this isn't a mutable data structure. Rather, what elements it starts with are there forever. Instead, you build up more complex sets through set algebra, which has well defined axioms, and identities that can be utilized. But let's cover the most basic sets first.

readonly Set<Int32> smallPrimes = Set.From(2, 3, 5, 7);
Enter fullscreen mode Exit fullscreen mode

At this point, smallPrimes cannot change, both in that the instance isn't mutable, and the reference is readonly. Testing for existence within a set is simple and obvious.

Assert.True(smallPrimes.Contains(5));
Assert.False(smallPrimes.Contains(6));
Assert.False(smallPrimes.Contains(11));
Enter fullscreen mode Exit fullscreen mode

But there's another type of simple set that we should cover! Consider how, with ISet<T> one would define a set of all even or odd numbers, or even just all numbers! The standard set of even numbers for a 32-bit integer would use up 8,589,934,588 bytes of memory! I don't know about you but I don't have a computer than can use that approach. I'm not Microsoft with millions of dollars spent on server farms and distributed computing technologies to allow for that. Maybe that's why they weren't concerned about it? 🤔

readonly Set<Int32> evens = Set.From((x) => x % 2 == 0);
Enter fullscreen mode Exit fullscreen mode

Congratulations, you now have a set of all even numbers, and it doesn't explode your computer!

Let's also consider the following sets:

readonly Set<Int32> odds = Set.From((x) => x % 2 != 0);
readonly Set<Int32> positive = Set.From(x) => x > 0);
readonly Set<Int32> negative = Set.From(x) => x < 0);
readonly Set<Int32> empty = Set.Empty<Int32>();
readonly Set<Int32> universe = Set.Universe<Int32>();
Enter fullscreen mode Exit fullscreen mode

The empty set is obvious: it's a set with no elements. The universe isn't so obvious, but it's the compliment of the empty set: it's a set with all elements.

That's right, we've defined sets for every single number, every single even number, and every single odd number, all in single lines of code.

Remember how in algebra class there were identities? These were typically 0 or 1, like how x/x = 1 or x-x = 0 or x+0 = x. They are extremely useful for math overall, and the centuries without identities and 0 especially were rife with problems. Well, in set algebra the empty set and universe set are those identities. Trying to do set algebra without these identities is like trying to do abstract algebra (the stuff you learned in school) without 0 and 1. You're gonna have a bad time!

So let's use these in some simple algebraic operations.

readonly Set<Int32> positiveEvens = positive & evens;
Enter fullscreen mode Exit fullscreen mode

Here, we just did the intersection between all positive numbers and all even numbers. Any negative numbers, even or odd, will not test true for existence, even though one of the constituent sets would.

There's a useful way to understand the operators used for this: what operator would you use for testing existence in two arrays? In this case, you loop through the "positive array" and loop through the "even array", and only if it's in both, is it part of the set.

readonly Set<Int32> oddsWithoutSmallPrimes = odds - smallPrimes;
Enter fullscreen mode Exit fullscreen mode

Here, we just did the difference between the set of odd numbers and the set of all small prime numbers. So you don't have to scroll up, smallPrimes was defined as [2, 3, 5, 7]. Taking 2 out of all odd numbers doesn't do anything, because it was never in there to begin with. Taking the rest out makes then test false instead.

Assert.True(oddsWithoutSmallPrimes.Contains(1));
Assert.False(oddsWithoutSmallPrimes.Contains(2));
Assert.False(oddsWithoutSmallPrimes.Contains(3));
Assert.False(oddsWithoutSmallPrimes.Contains(5));
Assert.False(oddsWithoutSmallPrimes.Contains(7));
Assert.True(oddsWithoutSmallPrimes.Contains(9));
Assert.True(oddsWithoutSmallPrimes.Contains(11));
Enter fullscreen mode Exit fullscreen mode

Hopefully this makes sense.

Overall, the following set operations have been defined with the corresponding C# operators.

C# Set Operation
- Difference
! Complement
& Intersection
^ Disjunction
| Union

Again, every operator was chosen to correspond with the equivalent logical operator that would be performed if testing these yourself. Except - of course, where that's the difference just like in abstract algebra.

So how does all this work? Through expression trees. It's worth pointing out that this has nothing to do with the LINQ expression trees. Rather, this is an actual data structure, itself a type of multiway tree. Any literal set is not modified, but set operators build up a tree representing the expression as you wrote it, and return that tree as a new set. This seems expensive at first, but remember how much memory using a delegate to describe the set saves? Expression trees can actually yield even more savings, although only above very simple usage. Because each set is immutable, we can freely reuse them across sets, which means arbitrary composition of sets reuses existing instances, saving even more memory. Fantastic.

Now what about cases where you want a list or tree or whatever, of unique elements. If the element is already in the collection, it shouldn't be added again. Well I got news for you. That's not a set. That's a unique whatever-the-base-collection-was, which is its own concept. Combining the two is a huge part of why the ISet<T> abstraction is so convoluted and irregular.

Discussion

pic
Editor guide