Catherine Galkina for Typeable

Posted on Aug 20, 2021 • Edited on Sep 20, 2021 • Originally published at blog.typeable.io

The 7 assumptions about strings you probably have

#unicode #ascii #programming #security

Author: Ville Tirronen

How Unicode erases most of our assumptions on How Strings Actually Work

We programmers mostly fly by the seat of our pants when it comes to writing simple stuff. For simple things, we have a strong set of assumptions instead of specific knowledge of how things work. These are assumptions, such as knowing that if b = a + 1, then b is greater than a or that if we malloc some buffer, we now have the requested amount of memory we can write on. We don't go and look at the specifications for each and everything small thing we do.

We do this because checking everything would slow us down. But, if we did check, we'd find out that we're usually wrong in our assumptions. There are numeric overflows and then a + 1 might be a lot less than a. Sometimes malloc will give us a null instead of a buffer and were hosed.

We usually have to be bitten by these issues before we update our assumptions even a little bit. And even then, we usually correct them in broad strokes. After having a nasty overflow bug, we might correct our assumptions on integers to "a + 1 is greater than a unless there's a chance that a is a very big number". And we work based on that instead of having any precise rules how overflows work in our minds.

Adjusted assumptions are called experience. They make you faster and correct more often. However, we might relocate some stuff, like proper handling of malloc, entirely from our internal category of 'easy stuff' to our internal category of 'complex stuff'. And then we might actually go and look up how it works.

About Strings

For beginners, Strings are the archetypal example of 'easy stuff'. Most likely, we learned letters and numbers as children and they feel very familiar to us. Secondly, when learning to program most of us have done lot's of programming exercises using Strings, because they are about the only interesting pre-built data type in most languages. We feel quite confident how Strings work when programming with them. Thirdly, we might have a good number of assumptions related to functioning of some simple character set, like ASCII or ISO-8859-1. Either because we're that old, or because our teachers were that old. Those were character sets of the simpler times!

Origin:https://upload.wikimedia.org/wikipedia/commons/thumb/5/5b/UNIVAC_1050-II.jpg/1280px-UNIVAC_1050-II.jpg

Now, in the real, Strings are a very complicated thing. Contrast them to, for example, your usual, found in any language, Int. We know and understand the representation (64 bits, two's complement) (or we can spend 15 minutes in Wikipedia to learn it) and we understand its semantics (behaves like a number, except if too large or too small). For Strings, we used to know the representation (one byte per character, check the ASCII table for what character it is), but we almost never know the semantics. Our String could contain our customers name. It could contain a number, bit of JSON or even an SQL statement.

Strings are the ultimate Any-type and chances are that if there is no ready made representation for some item in a program, it will be stored and operated as a String. Regardless of whether you have dynamic or static types, this throws all type safety to the wind. And, to compound, many of the things we use Strings for are bloody dangerous, like SQL or HTML. And for that reason, SQL injections and cross site scripting lead the vulnerability top lists year after year.

But, at least we understand how Strings work, as you know, Strings? We know how to concatenate, change case and so on, right?

Unicode

Understanding Strings is lot harder now than it was in around year 2000. We have been transitioning to Unicode for few decades now and its already been few years since I've heard anyone complain why their characters aren't displayed right. Printing them is another matter. I hope that it will be solved in 22th century.

While being otherwise awesome, Unicode effectively erases most of our 'useful' assumptions on how Strings actually work, but we haven't been very verbal on that happening. And unfortunately, many of us are probably still working with outdated assumptions on how Strings work. And, to make it worse, many of us no longer understand the memory representation of Strings either. Admittedly, I don't, really.

Broken assumptions

Next, lets go through some of my old assumptions that I needed to throw out with ISO-8859-1 character set. Surely, this is not an exhaustive list, but hopefully it is enough to kick the (Unicode) Strings out of your mental compartment of 'simple things'.

A character is representable by single byte

In the olden days of ASCII, each character fitted it seven bytes, making it easy to size buffers and scan memory. With Unicode this is an terrible assumption. Let's walk through one arbitrary example to show why.

At some point, Wordpress devs were fighting to stop SQL-injections from happening. The one example issue they were trying to fix was someone adding unwanted single quotes in the user input and messing their database with it.
Something like this imaginary example:

select 1 from accounts 
where user = '%s' 
    and password = '%s'

↓↓ (User supplies "whocares' or true -- " as password)

select 1 from accounts 
where user = 'Avery' 
    and password = 'whocares' or true -- '
-- And now everyone can log in as Avery!

Now, the simplest imaginable way to solve this is to properly encode the single quote in the user input. (But, that is simple in imagination only. Don't). That is, each single quote ' must be encoded as \', or backslash-single-quote.

PHP devs then wrote addslashes function and everything was well for a while. The only problem was that they did the escaping byte by byte and not character by character. The devs were also blind to the problem as they only worked with single byte Unicode characters (mostly old ASCII). Then, someone figured out that if you fed the system a String like "뼧 or true -- " you'd get the SQL injection again.

To understand why lets look up how these characters are represented in Unicode:

code	character
`0xbf27`	`뼧`
`0xbf5c`	`뽜`
`0x27`	`'`
`0x5c`	`\`

What the addslashes actually did was to replace all the value 27 -bytes with bytes 5c 27. So, "뼧 or true -- " turned into "뽜' or true -- " and again, there were injections.

It is not hard to imagine other similar disasters.

String lengths are somewhat stable

In ASCII, the many of the common String processing operations were invariant regards to the length of the Strings. This is not so with Unicode. And though this property is probably relevant only if you're manually allocating buffers, or trying to size up graphics, let's look at few cases where String lengths change unexpectedly.

Firstly, to pick a common String operation as an example, does length(x) = length(toUpper(x)) hold for Unicode x? No, since Unicode has, among other things, ligature characters such as ﬁ, which expand 2 fold to FI.

Second example concerns normalization. Since there are multiple code points for the same character, Unicode forces you to do normalization so that two users don't, for example, end up with identical screen names. One would guess that normalization, or the process of picking up a canonical representation for some set of characters would not affect the number of normalized characters, but it indeed does: single character ﷺ expands 18 fold into صلى الله عليه وسلم.

So, it is probably better not to assume anything about lengths of Strings after any operation.

Upper and lowercase are somehow linked

We, who lived with variants of ASCII tend make lot of use of upper and lower casing operations. Besides of them now being able to change the lengths of the Strings, there are some additional sharp edges. Most importantly, the old assumption that upper and lower case letters are in unique correspondence is lost.

With Unicode, converting string to uppercase can lose more information than just what case the characters were in. For example if you lowercase the Kelvin symbol K, you get an ordinary lowercase k back, with no way of converting it back. This has surprisingly lot of relevance when doing case insensitive comparisons, since toLower('K') == toLower('k') but toUpper('K') != toUpper('k').

Origin: https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Upper_case_and_lower_case_types.jpg/800px-Upper_case_and_lower_case_types.jpg

Space is 0x20

This assumption is still true. The byte 0x20 represents space in Unicode. But so do U+2000, U+2001, U+2002 and many others, including a zero width space character U+FEFF. Whitespace is special. We can't allow screen names like "TheAlex" and "TheAlex " at the same time because HTML will not show that whitespace and other users couldn't tell the difference. So we must remove leading and trailing whitespace before processing.

And now, Unicode makes it possible to screw up royally here. All it takes is one spot in the code where someone forgets about multitude of whitespace and we end up with unnormalized data in our database. And things start to go fail here and there.

Characters look different

Unlike ASCII, Unicode has multiple code points for the same character and multiple characters that look nearly, or completely, identical without being the same character. As a concrete example, paste "tyрeablе" == "typeable" to your favourite REPL. repl.it is handy if you have none at the hand.

Got False? That is because the p is not a p but a Russian character for 'er' sound.

As to why this is a problem, let's take this bit of our database schema as an example:

"uniq_address" UNIQUE CONSTRAINT, btree (country, city, address)
"uniq_name" UNIQUE CONSTRAINT, btree (name)

I would posit that in Unicode era, these constraints make no sense at all.
Being user input the user is free to mimic whatever address or name they want. This allows the user to attempt all kinds of heists by, say having same screen name as someone else. Also, things like addresses don't stay
digital. Sooner or later, it's going to be read or printed and then the difference, which the database was keen to notice, will be gone. Is there anything analog in your process that could be exploited by pretending to be an another user?

This problem certainly preceded Unicode, especially in some character sets like ISO-8859-5, but Unicode makes this much worse and more widely applicable. Getting down to it, you can't assume almost anything about how the string is going to l̷o̵o̷k̵ ̶l̴i̴k̵e̷.

Text goes from left to right

Quickly, what happens if I'd paste this to my terminal?

‮rm -rf your_home_directory # dlrow olleh ohce

I dare you to try yourself. You can use any reasonable dumb thing to paste this in instead of your terminal if you care about your home directory.

Some languages are not written from left to right, and to accommodate them, Unicode has these 'flip the direction of writing' -codes. The actual text is the same even though it is written from right to left, so your terminal probably would try to wipe your files if you had tried my example.

Besides messing with my colleagues on Teams with this, this bidirectional writing has been used for quite a many hoaxes, the longtime favourite being flipping long URLs backwards so they look innocent.

Strings have the same decoding

One of the things we happily assumed with ASCII (and variants) was that the decoding was trivial and unlikely to go wrong. Some of my University colleagues can read ASCII fluently from hex dumps! This meant that the only problem when transmitting data as Strings was to correctly parse the contents of the String.

Unicode, being a multibyte encoding adds another step. You must first parse the String, before you can get started on the content.

Now, parsing is one of the problem areas that is known to cause security issues. One of the key problems is that the same String may get parsed differently in one program than in another. A good contemporary example of this is having and html sanitizer (thing that stops XSS) speak bit different dialect of HTML than the browser that the user is using. If these bits disagree on the interpretation of some String, the sanitizer might judge it to be free of scripts and other malicious items, while the browser could interpret things slightly differently and start executing bits of the input as scripts. Using the same channel for control and content must be worth more than the billion dollar mistake of including null in programming languages!

Now this is exacerbated by Unicode, since not all Unicode parsers agree on all sets of bytes. Mostly, it is the illegal Unicode sequences that get handled differently. For example "e3 80 22" is an invalid Unicode sequence and one Unicode parser might judge it to be one illegal character while another could be more lax and interpret it as three: ã, \x80 and ". Now, to put this into web context, the last of the three could be a problem since
it would allow XSS through attribute values.

Concluding thoughts

As a software engineer, Unicode puts a lot of complexity on my table and much of that I really wouldn't need. The individual gotchas listed above are not so hard to handle by themselves, but the effect their presence has on the whole system can be significant. Now you need to decide what kind of strings you allow in your system, you need figure out how to properly normalize them, how to eliminate homoglyphs and strip and trim whitespace.

The problem with this is that all such things must happen uniformly. If you normalize a String in a certain way in one bit of your program and some other bit does it differently, you have an inconsistency, or a security issue at the worst. You also have to take this into account, because, well mistakes happen and try to record precisely what has been done to each String so you can take that into account when using them.

And, unfortunately, no, you cannot just 'fix your strings' at every use point. Some string operations are only safe to do once or you lose information or worse. You need to know and track the semantics of Strings to know what steps you need, and what steps you can't take in the context you are working on.

Addendum: I'm bit of a sloppy writer, so, I feel I need to re-iterate the original point, so I don't come off as some ASCII-fan.

Some people read this as argument against Unicode, which it is not. I don't want to come back to ISO-8859-1, because that sucks. Also, I'm ready to deal with lot of complexity to allow people write their names properly. What I tried to argue here was that working with Unicode is necessarily more complex than working with ASCII was. And that I see people going about with lots of beliefs about string processing that belong to ASCII era and which do not work with Unicode.

Some of the examples are low level, some are dated, but some, like homoglyphs are ubiquitous. Whether they are relevant for you all depends on what kind of work you do and which language you use.

(Also, in first example, consider UTF-16 and PHP not having null terminated strings)

Top comments (3)

Florian Pigorsch • Aug 21 '21

The interesting thing is - as @rdentato points out in his recent post - that the UTF-8 encoding plays nicely with plain old C, as long as you don't need any special properties of Unicode.
E.g. UTF-8 never introduces a null byte, so strlen works; UTF-8 never introduces ASCII bytes (all bytes used for encoding are > 0x7F), so searching for ASCII characters in a UTF-8 encoded string still works by iterating over the strings bytes; etc.

Remo Dentato • Aug 23 '21

While I do agree that Unicode is complicated, I have to disagree about it ruining the code. It is much better than what we had before: a bunch of incompatible characters sets. Surely there are things that could have done differently, but it's easy for us to say it in hindsight.
Reality is that you very rarely need to face the full complexity of Unicode. If you have to fight with collating, maybe the complexting of combining characters may be spared to you.
Human writing systems are complex and irregular, setting a single standard (Unicode for now) is a huge effort, practically impossible to capture everything that we humans can write. Unicode provide us with a way, for how much complicated it can be, to deal with most of those writing systems.
It makes our code better, not worse!

𒎏Wii 🏳️‍⚧️ • Aug 21 '21

Just a small tip: If you're gonna write about unicode, a good starting point might be knowing the difference between Unicode and UTF-8 ;)