In the 1960s, there were teleprinters and simple devices where you type a key, and it sends a collection of numbers, and the same letter comes out on the other side. But it was a nonstandard solution, so in the mid-1960s, America settled on American Standard Code for Information Interchange (ASCII).
It's a 7-bit binary system. Any number you type in gets converted into 7 binary numbers and sent.
In a nutshell, it means you can have numbers from
(64) (32) (16) (8) (4) (2) (1) 0 0 0 0 0 0 0 = 0 1 1 1 1 1 1 1 = 127
An interesting point here that they've made a clever thing.
A in this system is
65, which in binary
1000000 = 64 0000001 = 1 A = 64 + 1 = 1000001
Let's see on
B = 1000010 C = 1000011
And here's the hack, you can just knock off the first two digits and know what its position in the alphabet is. For lowercase, they did
32 number later, which means for
a = 97 = 1100001
And it's became a standard for the English speaking world.
What about languages that don't have an alphabet at all? They all came with their own encoding. But with a new day comes new computers. We move to 8-bit computers. So now we have to come up with a whole extra number at the start of every character to encode in 7-bit!
But no one settled on the same standard at this time. Japan goes and creates its own multibyte encoding with more letters and more binaries for each individual character. So from this point, all started to be massively incompatible!
But mostly the time you don't have such problems you just printed a document and faxed it. And then the world wide web hits, and there's a problem document's being sent all over the world. And here, let's move to
Unicode now has a list of more than a hundred thousand characters, that covers everything you could possibly want to write in any language (even if it's emoji language 😃). As a result, we have
Unicode Consortium assigning 100000+ characters to 100000 numbers. They don't do any binary representation; they just said:
hey, that Japanese character, that is number 5700 and this Cyrillic character is 1000-something.
Unicode, we operate with the next terms:
Abstract character - is a unit of information used for the organization, control, or representation of textual data.
Unicode deals with characters as abstract terms. Every abstract character has an associated name, e.g. LATIN SMALL LETTER A. The rendered form (glyph) of this character is
Code point - is a number assigned to a single character.
Code points are numbers in the range from
U+<hex> is the format of code points, where U+ is a prefix meaning Unicode and
<hex> is a number in hexadecimal. For example,
U+2603 are code points.
Remember that a code point is a simple number. And that’s how you should think about it. The code point is a kind of index of an element in an array.
The magic happens because Unicode associates a code point with a character. For example
U+0041 corresponds to the character named LATIN CAPITAL LETTER A (rendered as
U+2603 corresponds to the character named SNOWMAN (rendered as ☃).
Not all code points have associated characters.
1,114,112 code points are available (the range
U+10FFFF), but only
137,929 (as of May 2019) have assigned characters.
Code unit - is a bit sequence used to encode each character within a given encoding form.
The character encoding is what transforms abstract code points into physical bits: code units. In other words, the character encoding translates the Unicode code points to unique code unit sequences.
As far as we know,
Unicode first and foremost defines a table of code points for characters. That's a fancy way of saying "65 stands for A, 66 stands for B and 9,731 stands for ☃" (seriously, it does). How these code points are actually encoded into bits is a different topic related to
For encode 100000 characters we need at least 17 (2 ^ 17 ~ 100000) binary digits to encode it, but an English alphabet should be exactly the same (for back-compatibility) -
A should be still
65. So if you have just a string of English text, you're encoding it at
32 bits per character. So you have
27 zeros and a few ones only with information. This is incredibly wasteful. So every English text file has to take for times space on the disk.
- Problem 1. You get rid of all zeros in English text.
- Problem 2. There are a lot of old computers that interpret 8 zeroes in a row as a NULL, and as a
this is the end of the string characters. So if you send 8 zeroes in a row, they just stop listening. So you can't have 8 zeroes in a row everywhere.
- Problem 3. It has to be backward compatible. If you sent to system
UTFencoded string, that only supports
ASCIIyou still should get a valid English text.
To get started it just use
ASCII if you have something under
128, it means that it can be expressed as
7 digits. So in
A is encoding same:
A = 01000001 = 65
So it's still
ASCII valid. Now let's going above that, and as you remember, it should still be valid for
ASCII. For this we use the next headers:
110 - the start of new character header, two ones means two bytes. A byte being
10 - means a continuation
So let's take a look at an example:
__________________________ ______________________________________ | | | 110 x x x x x 10 x x x x x x (the stater) (5 characters) (continuation header) (6 characters)
So now you can just take all numbers excludes headers and you get
x x x x x = 5 characters x x x x x x = 6 characters 0 0 1 1 0 <> 1 1 0 0 1 0 = 434
But what about above that?
1110 started header which means that you have
3 bytes. One header and 2 continuation headers:
_________________ __________________ ________________ | | | | 1110 x x x x 10 x x x x x x 10 x x x x x x
So you can go and even higher specification goes to
1111110x. So this hack avoids waste, it's backward compatible and no point ever sent 8 zeroes in a row.
Thanks for reading the post and for your time. If there're any questions feel free to write a comment below. I know that I added a lot of simplifications, but I'm ready to fix them.
Feel free to ask questions, to express any opinion, and discuss this from your point of view. Make code, not war. ❤️