Unicode
Today the World speaks Unicode. More precisely, today the World speaks UTF-8 encoded Unicode. So long to ISO-8859-x, KOI8 and Shift-JIS; and also to UCS-2 and other multibyte encodings!
7-bit ASCII is heartly welcome to stay but let's do our best to promote the use of UTF-8 everywhere. It is worthwhile!
For us C programmers, the price to pay is to get rid of the char
type we all know and love; with the assumption that a character will fit into a byte it is no longer adequate.
C90 introduced wchar_t
and a bunch of functions to help dealing with non-ASCII encodings but I always found them unnecessarily complex and confusing.
Of course, when dealing with Unicode strings, you can grab one of the Unicode library available for C (like ICU and libunistring ) and go with it but they are very complex and, maybe, you don't need all the features they offer.
Thanks to the beauty of UTF-8 (pure genius), most likely you need very little or no code at all to handle UTF-8 encoded strings depending on what you have to do with those strings.
Let's delve deeper to understand when a full-fledged library is needed and when you can just use the tools you already have at your disposal.
The Encoding
Let's just remind ourselves how UTF-8 works. Actually, it's very simple: given an Unicode codepoint (let's call it a character even if we know it's not 100% accurate) its bits are spread into multiple bytes according the following table:
range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
0000 - 007F | 0xxxxxxx | |||
0080 - 07FF | 110xxxxx | 10xxxxxx | ||
0800 - FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
10000 - 10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Additional rules for a valid UTF encoding:
- it must be minimal (it must use the smallest possible number of bytes)
- codepoints
U+D800
toU+DFFF
(known as UTF-16 surrogates) are invalid and, hence, their encoding is invalid.
I'll deal with validating the encoding in a future post, for now let's see what UTF-8 allows us to do by simply ignoring the fact that the string is, indeed, UTF-8 encoded!
Useful properties
The UTF-8 encoding has many useful properties. Most notably:
- The first 128 characters occupy just one byte and their encoding is the same both in ASCII and UTF-8.
- The two most significant bits of the first byte of a multibyte encoding are
11
(i.e. if(b & 0xC0) == 0xC0
the byteb
is the first byte of a multibyte encoding); - The two most significant bits of the next bytes of a multibyte encoding are
10
(i.e. if(b & 0xC0) == 0x80
the byteb
is part of a multibyte encoding); - No
NUL
character ('\0
') is introduced as byproduct of the encoding, meaning that our convention that a string is 0 terminated, is safe. - UTF-8 preserves ordering: the relative order of two encoded character is the same as their unencoded order.
The fact that any ASCII character is also an UTF-8 encoded text greatly simplify some tasks. For example, if you have to work with CSV (comma separated values) files, and you are not interested in the content of the fields, you can completely ignore the fact that the file is UTF-8 encoded since the separators are most likely to be also ASCII characters (',
', ';
', '', ...)
Also note that being able to easily identify the first byte of an encoded character makes possible to easily move to the next or previous character in the string starting from any point; even from a byte in the middle of a multibyte encoding. This is a very desirable property for an encoding, meaning that one can quickly re-sync if something went wrong in decoding.
Nothing (or very little) to do here
As results of the above mentioned properties, many functions in the C standard library continue to work (possibly with some caveat):
-
strcpy()
,strcmp()
,strstr()
,fgets()
, and any other function that relies on ASCII terminators (\0
,\n
,\t
, ...) are completely unaffected. -
strtok()
,strspn()
,strchr()
, will work as long as their other argument is within the ASCII range. - For
strlen()
,strncpy()
, and other size limited functions, then
parameter express the size (in bytes) of the buffer the string is in, not the number of character in the string.
In general, for any function you want to use, ask yourself if it makes any difference if the characters are encoded as UTF-8 or not and just write that minimal code you may need.
You may take advantage of the UTF-8 encoding to write simple functions like this:
// Returns the number of characters in an UTF-8 encoded string.
// (Does not check for encoding validity)
int u8strlen(const char *s)
{
int len=0;
while (*s) {
if ((*s & 0xC0) != 0x80) len++ ;
s++;
}
return len;
}
Or something more complex (but still not so complicated):
// Avoids truncating multibyte UTF-8 encoding at the end.
char *u8strncpy(char *dest, const char *src, size_t n)
{
int k = n-1;
int i;
if (n) {
dest[k] = 0;
strncpy(dest,src,n);
if (dest[k] & 0x80) { // Last byte has been overwritten
for (i=k; (i>0) && ((k-i) < 3) && ((dest[i] & 0xC0) == 0x80); i--) ;
switch(k-i) {
case 0: dest[i] = '\0'; break;
case 1: if ( (dest[i] & 0xE0) != 0xC0) dest[i] = '\0'; break;
case 2: if ( (dest[i] & 0xF0) != 0xE0) dest[i] = '\0'; break;
case 3: if ( (dest[i] & 0xF8) != 0xF0) dest[i] = '\0'; break;
}
}
}
return dest;
}
Conclusion
As a rule of thumb:
When you're asked to deal with UTF-8 encoded strings in C, ask yourself what aspect of the encoding really impacts your work. You may discover that being UTF-8 encoded is immaterial for the work you have to do!
Next steps
This post focused on the easy part to avoid scaring you away but there are two major aspects that needs to be discussed:
- validation: how to determine if a sequence of bytes is really an UTF-8 encoded character;
- folding: transforming characters between their uppercase and lowercase form (if any). That's a very complex point and most relevant for case insensitive comparison which is a very common task.
I'll address them in the next posts on this topic.
Please let me know if I missed something or if it wasn't clear enough. Your feedback is what makes this posts worth to write.
Top comments (4)
Very interesting series on Unicode in C. I wonder if properties of UTF-8, most importantly: never introduce a NULL in the encoding, were added specifically to allow for interoperability with C's NULL-terminated strings.
I think this is the case, as UTF-8 is the brainchild of Ken Thompson and Rob Pike, two protagonists of the Unix world since the beginning.
Ah, wasn't aware if the origin story - now it makes perfect sense...