Category class

#textprocessing #csharp #dotnet

I had just finished the first release of this project, so, bear with me when it comes to documentation, etc.

Whenever you dive deep into text processing, you wind up with a need for rich categorization of characters. There's all sorts of reasons for this. Validating a string contains various requirements, like for a password. Parsing source code, where only certain characters are allowed in certain tokens, and sometimes in particular orders. Trimming a set of characters from both ends of a string. Etc. "Whitespace" is a category. Only, it's not one of the UnicodeCategory values. It might be tempting to think of it as a combination of all the separator categories, but that's not true. And while you could define the set of characters yourself, this is incredibly tedious and error prone. Seriously, very error prone. Consider this from Microsoft:

Notes to Callers
The .NET Framework 3.5 SP1 and earlier versions maintains an internal list of white-space characters that this method trims if trimChars is null or an empty array. Starting with the .NET Framework 4, if trimChars is null or an empty array, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the IsWhiteSpace(Char) method). Because of this change, the Trim() method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim() method in the .NET Framework 4and later versions does not remove. In addition, the Trim() method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).

So if there's no "Whitespace" category defined in UnicodeCategory, then what's the deal? How do we declare something to work for any whitespace?

Okay, to be fair, this is contrived example because IsWhitespace() exists. But there's plenty of cases where this doesn't hold up, and there's actually some super neat stuff we can do if categories aren't defined through functions, which I'll describe as this post goes on.

"Whitespace" happens to be a property, not to be confused with a category. Unicode is weird like that. Conceptually for us, they are both categories though. We'd certainly like to declare either in the same way.

Enter Categories, a relatively small Stringier project designed to address this issue, and more.

The entire system works through a Category type which represents... a category!

Specifically though, it has three "favors", a StandardCategory<T>, a PropertyCategory<T>, and a RangeCategory<T>. Why are these generic? Well, CRTP, but ignore that. From there, you have multiple usable Category instances, like the standard categories: UppercaseLetter, DecimalDigit, SpaceSeparator, etc; like the properties: Whitespace, etc; and even categories defined in neither, like Superscript, Subscript, BoxDrawing, etc. Each of these types represents exactly that collection. But the UnicodeCategory values were also reworked into a tree. Need any letter? There's a category for that; Letter specifically.

As part of my work on Core, all the functions defined in there have already been adapted to use this Category API, and will refer to that for categorization purposes. This means the following is possible:

"  Hello world    ".Contains(new Whitespace());

And as simple as that, it doesn't matter what whitespace characters the string has, it'll work with any of them, not just U+0020.

Now from here, there's numerous possibilities. Some planned, some merely thought of, and some completely undiscovered. This issue describes the possibilities of creation of category expressions, for arbitrarily combining categories through set theory. The power that would provide is immense. There's also a project already in existence that would massively benefit from being based on Category: extracting the language/script/orthography component from Literary, which would then open the door for using language orthographies as a category, such as whether a text contains elements of a given language. This will also be adapted into Patterns, replacing it's premade patterns that served a similar role, allowing for an extensible and far more easily readable alternative to the \s, \S, \w, \W, etc nonsense in RegEx.

This API has me super excited given its highly declarative nature and immense possibilities. I hope it's got you excited too.

DEV Community

Category class

Top comments (0)