Andrey Frolov

Posted on

# Unicode and UTF-8

### Long story short

In the 1960s, there were teleprinters and simple devices where you type a key, and it sends a collection of numbers, and the same letter comes out on the other side. But it was a nonstandard solution, so in the mid-1960s, America settled on American Standard Code for Information Interchange (ASCII).

It's a 7-bit binary system. Any number you type in gets converted into 7 binary numbers and sent.

In a nutshell, it means you can have numbers from `0` to `127`.

``````(64) (32) (16) (8) (4) (2) (1)
0    0    0   0   0   0   0 = 0
1    1    1   1   1   1   1 = 127
``````

An interesting point here that they've made a clever thing. `A` in this system is `65`, which in binary `1000001`:

``````1000000 = 64
0000001 = 1

A = 64 + 1 = 1000001
``````

Let's see on `B` and `C`:

``````B = 1000010
C = 1000011
``````

And here's the hack, you can just knock off the first two digits and know what its position in the alphabet is. For lowercase, they did `32` number later, which means for `a`:

``````a = 97 = 1100001
``````

And it's became a standard for the English speaking world.

### New day new problems

What about languages that don't have an alphabet at all? They all came with their own encoding. But with a new day comes new computers. We move to 8-bit computers. So now we have to come up with a whole extra number at the start of every character to encode in 7-bit!

But no one settled on the same standard at this time. Japan goes and creates its own multibyte encoding with more letters and more binaries for each individual character. So from this point, all started to be massively incompatible!

But mostly the time you don't have such problems you just printed a document and faxed it. And then the world wide web hits, and there's a problem document's being sent all over the world. And here, let's move to `Unicode Consortium`.

### Unicode to the rescue

Unicode now has a list of more than a hundred thousand characters, that covers everything you could possibly want to write in any language (even if it's emoji language π). As a result, we have `Unicode Consortium` assigning 100000+ characters to 100000 numbers. They don't do any binary representation; they just said: `hey, that Japanese character, that is number 5700 and this Cyrillic character is 1000-something`.

So in `Unicode`, we operate with the next terms:

`Abstract character` - is a unit of information used for the organization, control, or representation of textual data.

Unicode deals with characters as abstract terms. Every abstract character has an associated name, e.g. LATIN SMALL LETTER A. The rendered form (glyph) of this character is `a`.

`Code point` - is a number assigned to a single character.
Code points are numbers in the range from `U+0000` to `U+10FFFF`.

`U+<hex>` is the format of code points, where U+ is a prefix meaning Unicode and `<hex>` is a number in hexadecimal. For example, `U+0041` and `U+2603` are code points.

Remember that a code point is a simple number. And thatβs how you should think about it. The code point is a kind of index of an element in an array.

The magic happens because Unicode associates a code point with a character. For example `U+0041` corresponds to the character named LATIN CAPITAL LETTER A (rendered as `A`), or `U+2603` corresponds to the character named SNOWMAN (rendered as β).

Not all code points have associated characters. `1,114,112` code points are available (the range `U+0000` to `U+10FFFF`), but only `137,929` (as of May 2019) have assigned characters.

`Code unit` - is a bit sequence used to encode each character within a given encoding form.

The character encoding is what transforms abstract code points into physical bits: code units. In other words, the character encoding translates the Unicode code points to unique code unit sequences.

### What is UTF

As far as we know, `Unicode` first and foremost defines a table of code points for characters. That's a fancy way of saying "65 stands for A, 66 stands for B and 9,731 stands for β" (seriously, it does). How these code points are actually encoded into bits is a different topic related to `UTF encoding`.

### What problems UTF solves

For encode 100000 characters we need at least 17 (2 ^ 17 ~ 100000) binary digits to encode it, but an English alphabet should be exactly the same (for back-compatibility) - `A` should be still `65`. So if you have just a string of English text, you're encoding it at `32` bits per character. So you have `27` zeros and a few ones only with information. This is incredibly wasteful. So every English text file has to take for times space on the disk.

To summarise:

• Problem 1. You get rid of all zeros in English text.
• Problem 2. There are a lot of old computers that interpret 8 zeroes in a row as a NULL, and as a `this is the end of the string characters`. So if you send 8 zeroes in a row, they just stop listening. So you can't have 8 zeroes in a row everywhere.
• Problem 3. It has to be backward compatible. If you sent to system `UTF` encoded string, that only supports `ASCII` you still should get a valid English text.

### How UTF solves such problems

To get started it just use `ASCII` if you have something under `128`, it means that it can be expressed as `7` digits. So in `UTF-8` `A` is encoding same:

``````A = 01000001 = 65
``````

So it's still `UTF` and `ASCII` valid. Now let's going above that, and as you remember, it should still be valid for `ASCII`. For this we use the next headers:

`110` - the start of new character header, two ones means two bytes. A byte being `8` characters

`10` - means a continuation

So let's take a look at an example:

`````` __________________________ ______________________________________
|                          |                                      |
110         x x x x x       10                    x x x x x x
(the stater) (5 characters)  (continuation header) (6 characters)
``````

So now you can just take all numbers excludes headers and you get

``````x x x x x = 5 characters
x x x x x x = 6 characters

0 0 1 1 0 <> 1 1 0 0 1 0 = 434
``````

You go `1110` started header which means that you have `3` bytes. One header and 2 continuation headers:

`````` _________________ __________________ ________________
|                 |                  |                |
1110 x x x x       10  x x x x x x    10 x x x x x x
``````

So you can go and even higher specification goes to `1111110x`. So this hack avoids waste, it's backward compatible and no point ever sent 8 zeroes in a row.

### The bottom line

Thanks for reading the post and for your time. If there're any questions feel free to write a comment below. I know that I added a lot of simplifications, but I'm ready to fix them.

Feel free to ask questions, to express any opinion, and discuss this from your point of view. Make code, not war. β€οΈ