DEV Community

Patrick Kelly
Patrick Kelly

Posted on

Collectathon

Typically, when I talk about work I'm doing, I talk about Stringier because I mostly deal with text processing now, and that set of libraries serves the foundation for a lot of the work I do. Over the two years of developing Stringier, I've been increasingly implementing specialized data structures, or even just data structures that I think should be in the standard library, but aren't. Sometimes, there's even the issue of the implementations just being bad, whether problematic API's or just less efficient or performant than they could be. Whenever you're doing a lot of the same thing, you've got a pattern, and you want to start looking into frameworks that aide with that pattern.

So is there a framework for collections? Not really. .NET has CollectionBase that makes all sorts of assumptions about how your collection works and what features it supports. It's a leaky abstraction that forces your collection to behave how it says, and that's just not realistic. This may be a major factor behind why .NET doesn't have a rich tree or graph API. It may even be behind the issues with ISet<T>. There's also C5, which isn't a bad option, but I think falls into many of the same traps that collections libraries fall into (and I've fallen for in the past, let's learn from my mistakes).

This is where I starting thinking about Collectathon again. Collectathon was originally an attempt at designing a better collections API based on separation of Abstract Data Type and Data Structure, as well as dependency injecting certain features that were implicit whether you wanted or could even use them, like filters. There was a major goal of trying to get as much code sharing as possible, and for a while was successful at this. But then I hit a wall. A hard wall. And this was the wake up I needed, and the eureka on why all collection libraries seem to have the same fundamental issue: They all make assumptions about collections that have nothing to do with that ADT because they all derive from the same core types, which by their very nature make those exact assumptions. See, collections don't form a taxonomy. One only needs to try to implement an unsorted list, self-sorting list, and unique list ("set") from the same base type to see the issue.

It wasn't until deep into the v4.0 audit and rollout of Stringier that the solution hit me, and it hit me like a swift kick in the ass that's been allowing me to quite literally pump out several efficient collection implementations a day, even though they are drastically different ADT's, without leaking implementing details (in fact, as of this writing, IResizable is the only abstraction leak). It turns out, my discovery of real traits in C# was critical for doing this, allowing moving sharable code outside of the types entirely, so that the hierarchy could be squashed, with no actual base type at all. In fact, the only hierarchies at all are within specific ADTs and even then they get broken when appropriate. But none of this is obvious using the trait based API.

What all this means is that Collectathon is two things. First, is a framework for the creation of custom collections. Second, is implementations of various collections using that framework. So, not only is it easier to provide generic collections within Collectathon, but it's also easier to provide specialized collections within Stringier. Fantastic! This, by implication, means that development on Collectathon is development on Stringier, especially since the end goal is to simplify Stringier immensely.

Since the work on Stringier is going to take quite a while, and most of the functions I'd like to talk about are in Core which requires the most work, I really can't talk about them yet. But, what I can do is talk about data structures as I implement them. This isn't just a "hey here's some basic data structures I learned about in a 30 minute data structures video" kinda series though. No. I'm gonna talk about thinks like what partial-unrolling means to a linked-list. How skip-lists can be superior to binary-trees when you know you have ordered data. What a trie or in-tree is. How a proper set is implemented and works. And more!

Discussion (0)