I'm going to assume you've read about the Rune
struct. If you haven't, do so now.
Rune
does wonders. It enables us to get away from working with encodings, and instead working with UNICODE scalar values. But there's still a problem.
"café" == "café"
Well, that's true, right? Nope. No seriously. Try it.
So what is going on?
Time to fall back to the ol' trusty escapes.
"caf\u00C9" == "cafe\u0301"
Obvious now? The problem is that while they look the same to us, they don't look the same to the computer
So what's going on?
Back when UNICODE was being created, there was a big focus on keeping things reasonably backwards compatible. If you want things to be adopted, you have to try to make them easy to adopt. Makes sense, right? I have my own additional thoughts to add on, but this isn't the place for that. Encodings for European languages that used accents like the acute mark on é
historically would allocate a code point for every single character. So, in its early stages UNICODE did the same. But as the UNICODE Consortium learned more about the incredible diversity of the worlds languages, they learned of a problem: backwards compatibility introduces some really naïve notions, and we actually should have had a system for composing base characters with various other stuff. In our case, that's accents, or diacritics, or whatever you feel like calling them.
See, U+00C9 == U+0065, U+0301
to any human. Hell, even for many computer applications they are the same. But how do we get the computer to see that? After all, the computer doesn't see. Well, okay, OpenCV is a thing, but I think everyone knows that isn't the solution here.
Rune
set a fantastic precedent in .NET. Not only did it make clear that original types had flaws, but it established the convention of a very similar feeling API. So...
The problem I was describing earlier is that of recognizing an "extended grapheme cluster" as equivalent to its "precomposed character". The solution for this, in my opinion, Glyph
, which you can get here.
Glyph
does a very similar concept to Rune
, but instead of layering on UNICODE scalar values on top of UTF-16 sequences, it layers on UNICODE grapheme clusters. Unlike much of the larger project it is a part of, Stringier, Glyph
is part of a semi-FOSS project. It's composed of two libraries, Glyph and Glyph.Tables. Extending or fixing Glyph
is very easy because of this model, as the entire type, and all its algorithms, are driven by the FOSS tables. They're so simple you don't even need to know how to program to contribute to them.
So let's revisit that example with Glyph
in mind.
Glyph.Equals("café", "café");
Okay, that's fine and dandy, but doesn't that do the same as this:
String.Equals("café", "café", StringComparison.InvariantCulture);
Well, yeah. But that only ever works for Equals()
. Consider another example.
"café"[3] == "café"[3]
We know from covering Rune
before that this will be false. But what about using the Glyph
-equivalent that we did then?
"café".GetGlyphAt(3) == "café".GetGlyphAt(3)
And that actually evaluates to true
. Neat.
Glyph
is largely meant to unify and simplify a lot of what's already in the .NET CLR, but scattered all over the place and sometimes not implemented well. There's everything from indexers to enumerators to, well, an actual type you can work with. It's meant to feel very much like the experience of working with Rune
, which is, of course, meant to feel like the experience of working with Char
.
I think everyone can agree: If you can reuse existing knowledge, this shit is a lot easier to learn. Let's keep to reusing roughly the same API, but at different levels of abstraction, instead of inventing entirely different and sometimes vaguely defined concepts. Sure, it's a tower of abstractions. But it's also a tower of understood abstractions.
This being said, Glyph
still doesn't get us entirely where we want to be. There's other concepts like ligatures that go even beyond this, that Glyph
simply doesn't handle.
Top comments (0)