Remo Dentato

Posted on Aug 4, 2021 • Edited on Oct 15, 2022

UTF-8 strings in C (2/3)

#c #utf8 #encoding

Validation

An ironclad rule we should always follow is that:

any UTF-8 string must be validated before being used.

I won't delve into the reasons why but they are very important.

So, when in the previous post I said that "you may handle a UTF-8 encoded string as if it was not encoded" (i.e. skipping validation), what I really meant is that you can do it only if you are 100% sure that the string has already been checked or it will be at a later time before being used in any meaningful way (including storing it in a DB).

If in doubt, you have to validate the string yourself.

Looking at table on the previous post (or on Wikipedia), it's clear that not every sequence of bytes is a valid UTF-8 encoding.

For example the sequence of bytes C2 41 is clearly invalid because the second byte of a multibyte UTF-8 encoding must be greater that 0x7F.

There are many possible ways to validate UTF-8 encodings.

There are extremely fast, parallel algorithms but our objective here is to understand how how validation works and let's get ready to deal with it.

Get characters

The most basic operation we do on a sequence of characters is to get the next one.

What we need is a function that takes a string, checks if the first bytes are a valid UTF-8 encoding and, in case, returns the corresponding character plus the number of bytes the encoding spans.

We also need to decide what to do with invalid encodings. The Unicode standard reccomends that any invalid encoding should be replaced by the character U+FFFD: � (a losenge containing a question mark) but, as C programmers, it seems odd to me to return something that is completely unrelated with the string or the error.

Let's decide that if we find an invalid sequence of bytes, we'll return the first one (i.e. a length of 1) and will set errno to EINVAL to signal the error.

Character representation

For the internal representation of characters, we have two choices:

decode the codepoint and use the Unicode value in our program
directly use the encoded form

The first one seems more logical but if you think about it, it's often a waste of time. Consider comparing two strings, for example. Comparing the encoded form is exactly the same as comparing the codepoint values but it saves the decoding step for the two strings.

On the other hand, the second one is less user-friendly. If you want to use constants, you have to manually encode the codepoint value and use it.

However, I claim that referring to "ジ" as 30B8 (the Unicode codepoint) or as E382B8 (its UTF-8 encoding) makes not such a big difference.

Any valid UTF-8 encoding will fit into 4 bytes so let's define a new type u8chr_t:

typedef uint32_t u8chr_t;

Encoding Length

To determine the length of the encoding looking at its first byte we can define an array and a macro that uses the upper 4 bits of the byte as an index in the array:

static uint8_t const u8_length[] = {
// 0 1 2 3 4 5 6 7 8 9 A B C D E F
   1,1,1,1,1,1,1,1,0,0,0,0,2,2,3,4
} ;

#define u8length(s) u8_length[(((uint8_t *)(s))[0] & 0xFF) >> 4];

The macro will tell us the length of the encoding of the first character.
Here are some examples:

u8length("abc")    -> 1  ('a' is "\x41")
u8length("èfg")    -> 2  ('è' is "\xC3\xA8")
u8length("\x82HZ") -> 0  (invalid)
u8length("\xC3")   -> 2  (invalid, but we don't know yet)

Note that the last two ones are invalid encoding. In the first case we can immediately say that it's a stray byte. In the second one we will have to look at the next bytes to recognize that the encoding is invalid.

Loading the encoding

Now that we have the expected length, we can load the encoding one byte at the time (txt is the pointer to the string):

   u8chr_t encoding = 0
   int len = u8length(txt);

   for (int i=0; i<len && txt[i] != '\0'; i++) {
      encoding = (encoding << 8) | txt[i];
   }

Note that the code does not read past the end of the string if there are less byte than expected. If that happens, we exit from the loop and the encoding variable will keep what has been found so far.

Validate!

And now the core part: let's define a function to check if the loaded encoding is a valid one:

int u8chrisvalid(u8chr_t c)
{
  if (c <= 0x7F) return 1;                    // [1]

  if (0xC280 <= c && c <= 0xDFBF)             // [2]
     return ((c & 0xE0C0) == 0xC080);

  if (0xEDA080 <= c && c <= 0xEDBFBF)         // [3]
     return 0; // Reject UTF-16 surrogates

  if (0xE0A080 <= c && c <= 0xEFBFBF)         // [4]
     return ((c & 0xF0C0C0) == 0xE08080);

  if (0xF0908080 <= c && c <= 0xF48FBFBF)     // [5]
     return ((c & 0xF8C0C0C0) == 0xF0808080);

  return 0;
}

The function has just five if's:

[1] Check for ASCII values.
[3] Exclude the UTF-16 surrogates.
the other three look at both the values and the bit pattern to ensure it's a valid encoding.

Let's see in detail the check [2] (the other works in the same way).

This covers the range U+0080 - U+07FF (eleven bits) which is represented with two bytes: 110xxxxx 10xxxxxx.

Just replacing the bits, it's easy to see that the encoding for U+0080 is 0xC280 (1100 0010 1000 0000) and the encoding for U+07FF is 0xDFBF (1101 1111 1011 1111). Any encoding lower than 0xC280 must be invalid (it comes from a non minimal encoding!), any encoding higher than 0xDFBF will be further verified by the other if's.

However, not all the values in that range are valid: consider the value '0xCAF3' which is 1100 1010 1111 0011 in binary; the second byte is not in the form 10xx xxxx.
Masking the encoding with '0xE0C0' will reveal if the bits are what they are supposed to be: '0xC080'.

If it seems confusing, just try with a couple of codepoints. I'm sure it will become clearer in no time.

All the other if's work in the same way.

I've written a more detalaid description on Stack Overflow as a self response. I wanted to be sure that this method was correct before using it and I tapped on the SO collective wisdom.

All togheter now

Now we can complete our initial function that will return the length of the next character in a string:

int u8next(char *txt, u8chr_t *ch)
{
   int len;
   u8chr_t encoding = 0;

   len = u8length(txt);

   for (int i=0; i<len && txt[i] != '\0'; i++) {
      encoding = (encoding << 8) | txt[i];
   }

   errno = 0;
   if (len == 0 || !u8chrisvalid(encoding)) {
      encoding = txt[0];
      len = 1;
      errno = EINVAL;
   }

   if (ch) *ch = encoding;

   return encoding ? len : 0 ;
}

Note that at the end we added another check: if the character is '\0' the length is also 0. In this way you can add the return value to the current pointer into the string and be sure that you will not past the end of the string.

Decode/Encode

Decoding and Encoding characters is just a matter of masking the bits of interest and shifting them in an integer.

To understand what's going on in this functions, you should keep the table with the structure of UTF-8 encoding in front of you.

// from UTF-8 encoding to Unicode Codepoint
uint32_t u8decode(u8chr_t c)
{
  uint32_t mask;

  if (c > 0x7F) {
    mask = (c <= 0x00EFBFBF)? 0x000F0000 : 0x003F0000 ;
    c = ((c & 0x07000000) >> 6) |
        ((c & mask )      >> 4) |
        ((c & 0x00003F00) >> 2) |
         (c & 0x0000003F);
  }

  return c;
}

// From Unicode Codepoint to UTF-8 encoding
u8chr_t u8encode(uint32_t codepoint)
{
  u8chr_t c = codepoint;

  if (codepoint > 0x7F) {
    c =  (codepoint & 0x000003F) 
       | (codepoint & 0x0000FC0) << 2 
       | (codepoint & 0x003F000) << 4
       | (codepoint & 0x01C0000) << 6;

    if      (codepoint < 0x0000800) c |= 0x0000C080;
    else if (codepoint < 0x0010000) c |= 0x00E08080;
    else                            c |= 0xF0808080;
  }
  return c;
}

Conclusion

Encoding, Decoding and Validation may seem more complicated than they really are; some bit manipulation can do the trick very easily. Nothing really new.

What is a little bit more complicated is to extend the usual functions on characters. How can we get functions euivalent to isdigit() or tolower()?

Except for the most trivial tasks we need those functions and this will the the topic of next post.

Top comments (2)

Ross Mohn • Jul 3 '23

Thanks for this article. I'm now using parts of this code in my a4 terminal multiplexer project. One note, in the "Loading the encoding" for() loop I had to OR each character with 0xFF. Not sure if this is universal or just my code.

for (int i=0; i<len && txt[i] != '\0'; i++) {
    encoding = (encoding << 8) | (txt[i] | 0xFF);
}

Remo Dentato • Jul 20 '23 • Edited

Thanks Ross. Happy to know it has been useful!

If txt is a char or unsigned char there should be no need for the AND (I saw your code has, correclty, a &), but there are architecture where a char is larger than 8 bits, so doing it will be on the safe side.

DEV Community

UTF-8 strings in C (2/3)

Validation

Get characters

Character representation

Encoding Length

Loading the encoding

Validate!

All togheter now

Decode/Encode

Conclusion

Top comments (2)

Read next

chatGPT - C programming Linux Windows cross-platform - code review request

C debugger Not able to Print or take Input when in debugger mode is On

CS50's Introduction to Computer Science - Week 1 / Notes

C and C++ are really so fast?