Maroun Maroun

Posted on Jan 10, 2020 • Edited on Jan 12, 2020

Unicode and Character Sets

#unicode #tutorial #encoding

Computers have their own language. They don't understand human languages (maybe they will if you're weird and speak binary), and they don't know anything but binary. How do we communicate with them?

As I'm typing now, the computer is not aware of any of the characters you're seeing; Let's consider the "M" character. At the lowest level, "M" and 77 are stored using the exact sequence of 0s and 1s: 01001101.

Before we proceed, let's remember two fundamental definitions:

Bit - smallest unit of storage, can only store 0 or 1
Byte - 1 Byte = 8 bits. That's it

Unicode

The Unicode Standard:

The Unicode Standard is a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world. In addition, it supports classical and historical texts of many written languages.

In simple words, Unicode assigns every character a unique number (called code point), regardless of the platform, program or any other thing.

Character Set

A character set is a fixed collection of symbols. For example, "أ" to "ي" is a character set representing the Arabic alphabet.

Another example is the famous ASCII table - A 7-bit character code where every sequence represents a unique character. ASCII can represent 27 (= 128) characters (including non-printable ones), but sadly, it can't represent love ♥ 😔, Hebrew, Russian, Arabic alphabets, or even more useful characters. But why?

Since any file has to go through encoding/decoding in order to be properly stored, your computer needs to know how to translate the character set of your language's writing system into sequences of 0s and 1s. This process is called Character Encoding. You can think about it as a simple table. To give you an intuition about what "table" means, take a look at the below image:

The "A" character is represented by the 65 decimal value (which is 1000001 in 7-bit binary).

So now the question is, how do we represent characters that are out of this range?

Encoding Systems

It's very important to distinguish between a Character Set and Encoding System concepts. The first one is simply a set of characters you can use, while the latter is the way these characters are stored in the memory (as a stream of bytes), so there can be more than one encoding for a given charset.

Just like the ASCII, there are many other encoding systems:

UTF-8
UTF-16
UTF-32
EUC

In this post, we'll talk about the UTF-X systems.

UTF-32

This scheme requires 32 bits (4 bytes) to encode any character. For example, in order to represent the "A" character code-point using this scheme, we'll need to write 65 in 32-bit binary number:

00000000 00000000 00000000 01000001

If you take a closer look, you'll note that the most-right 7 bits are actually the same when using the ASCII scheme. But since UTF-32 is fixed width scheme, we must attach three additional bytes. Meaning that if we have two files that only contain the "A" character, one is ASCII-encoded and the other is UTF-32 encoded, their size will be 1 byte and 4 bytes correspondingly.

This scheme wasn't good for English speakers, because now the files that contain only ASCII characters, say their size X bytes, will turn into 4X bytes monsters (huge waste of memory).

UTF-16 (+ LE and BE)

Another solution in the form of UTF-16 came up. Many people think that as UTF-32 uses fixed-width 32 bit to represent a code-point, UTF-16 is fixed-width 16 bits. WRONG!

In UTF-16 the code point may be represented either in 16 bits, OR 32 bits. So this scheme is variable-length encoding system. What is the advantage over the UTF-32? At least for ASCII, the size of files won't be 4 times the original (but still twice), so we're still not ASCII backward compatible.

Since 7-bits are enough to represent the "A" character, we can now use 2 bytes instead of 4 like the UTF-32.

00000000 01000001

Some of you might be wondering now: "Why we appended the byte at the beginning and not at the end?". Since we know that 2 bytes are required, why can't we flip the representation and interpret the result from right to left:

01000001 00000000

Well, we can. Some companies actually use this encoding. Let's try to imitate the computer when it tries to read 16-bits. Let's write a simple C program that allocates 8 integers in a sequence for us:

int *p;
p = (int *) malloc(8 * sizeof(int));

The OS will return the address of the first byte. So "p" will point to first place, and by incrementing it we'll get the next bytes. If we store the data in this form (the below two boxes represents two bytes):

+---+---+
| 0 | A |
+---+---+

we'll have to move the pointer to the right before we begin reading, while it'll be read immediately in this case:

+---+---+
| A | 0 |
+---+---+

The second format is called Big Endian (data stored in the most significant byte), while the latter is called Small Endian (least significant byte).

UTF-8

You guessed right. In UTF-8 the code point may be represented using either 32, 16, 24 or 8 bits, and as the UTF-16 system, this one is also variable-length encoding system.

Finally, we can represent "A" in the same way we represent it using ASCII encoding system:

01001101

Playground

Open your favorite text editor (Vim) and create a file that contains the "A" character. Let's see its encoding:

$ xxd -b test.txt
0000000: 0100001 00001010

The first "0000000" is not important for our analysis; It denotes the offset. The byte after it is the binary representation of the "A" character in ASCII encoding (7 bits). Finally, the last bit is the ASCII encoding of the line-feed (validate by checking the ASCII table - look for the decimal value of this number). Now let's check the file size:

$ du -b test.txt | cut -f1
2

Great! 2 bytes as expected (remember the line feed that's automatically added).

Let's continue playing - let's remove the "A" and insert the "δ" character:

$ file test.txt
test.txt: UTF-8 Unicode text

Now let's check its size:

$ du -b test.txt | cut -f1
3

Why 3? Let's see the binary dump of the file:

$ xxd -b test.txt
0000000: 11001110 10110100 00001010

Let's ignore the last byte that represents the line-feed and focus on the actual encoding of the δ character:

11001110 10110100

Let's recall that if the first 3 bits are 110 then the character is encoded using two bytes. The binary representation of δ is:

11001110 10110100

Which is exactly what we got.

UTF-8 vs UTF-16

Both UTF-8 and UTF-16 are variable-length encoding. The UTF-8 encoding might occupy a minimum of 8 bits, while a minimum of 16 bits is required for UTF-16.

For basic ASCII characters, UTF-8 will use only one byte per character, while UTF-16 encoding will use two bytes (which makes UTF-8 backward compatible with ASCII).

Now let's talk about cases where UTF-8 encoding takes more bytes than UTF-16. Consider the Chinese letter "語" - its UTF-8 encoding is: