DEV Community

loading...

UTF-8

Nirmal Patel
Always learning
・2 min read

UTF-8 is a multi-byte variable-width character encoding scheme for saving Unicode codepoints - which allow displaying almost all characters from international languages.
UTF-8 uses 1-byte to store codepoints 0-127. So English text looks exactly the same as they look in ASCII.

ASCII

represents every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits.

Using 7 bits gives 128 possible values from 0000000 to 1111111, so ASCII has enough room for all lower case and upper case Latin letters, along with each numerical digit, common punctuation marks, spaces, tabs and other control characters.

ANSI

below 128, same as ASCII, but there were lots of different ways to handle the characters from 128 and up, depending on where you lived. These different systems were called code pages. For example in Israel, DOS used a code page called 862, while Greek users used 737.

Unicode

a single character set that included every reasonable writing system on the planet.

Characters are represented as CodePoints

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like: U+0639.
This magic number is called a code point.
The U+ means “Unicode” and the numbers are hexadecimal.

UTF-8

(8-bit Unicode Transformation Format)

UTF-8 is a system for storing strings of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes.

In UTF-8, every code point from 0-127 is stored in a single byte. This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII.

Code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

UTF-8 is therefore a multi-byte variable-width encoding. Multi-byte because a single character like Я takes more than one byte to specify it.
Variable-width because some characters like H take only 1 byte and some up to 4.

UTF-8 is universal and covers Latin characters as well as Cyrillic, Arabic, Japanese...

References used:

  1. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

  2. https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets/

Discussion (0)