Hi everyone, there are many file encodings in use, and many character sets to choose from, even though everyone should be using UTF-8 because it's very compact and can represent just about every character out there, but for legacy reasons or whatnot other formats are still in use.
I'm going to try to explain what each of these character formats looks like in this post, and the strengths and weaknesses of each. I will also give examples in Python, PHP and Ruby. Don't worry, no programming experience required to understand this.
Some character encodings you might be familiar with are:
- UTF-8 (most people's default format)
- Windows-1252 aka CP1252
- Latin-1 (Also known as ISO-8859-1)
- GB2312 (Chinese character set)
- Shift JIS (Japanese)
- Latin-9 aka ISO-8859-15 aka Latin-0
- us-ascii (rarely used)
If we use an analogy to compressing then this topic becomes easier to understand (Since you know that compression makes the file smaller, right?)
When you encode something, it gets "compressed" (well not actually) into a stream of human-unreadable bytes, just like real compression does. The characters you see on the screen are the Unicode code points (because Unicode can't actually be written to a file), and the process of writing a pattern of bytes into a file is called encoding.
It turns out there are many ways to encode sequence of characters, just like there are many ways to compress a file (.zip, .tar.gz, .tar.bz2, etc etc). So it is vital that the sequence of characters is uncompressed using the same pattern in order to produce the same text we had in the beginning. If we use any other encoding (pattern), it will produce garbage text, officially called Mojibake which is Japanese for "unintelligible sequence of characters". Also if a character you're trying to encode is not in the character set you are encoding with, it will encode mojibake instead.
And the process described in the above paragraph is called decoding. It basically turns the byte stream into a bunch of characters, the characters themselves are represented in a program's memory as an array of Unicode code points, unless another library does this conveniently for them (which is the recommended way to use Unicode anyway - use an already existing library that can handle it).
Unicode is just a bunch of code points, not an encoding. It is not a way to compress your characters. Unicode is just the list of code points available for the entire internet to type, and you cannot pretend a code point is a byte sequence and expect to get a character from them. The difference between a code point and character is vital.
The encoding, also called a character set, is the list of characters organized by hexadecimal numbers. When actually typed though, the characters become bytes. Fundamentally, characters are abstract. Ints are abstract. Even pointers to stuff are abstract. Anything that's stored in memory and used in a CPU must be bytes and all the abstract stuff is broken down by the compiler or interpreter.
A code point in Unicode is any of the numerical values that make up the Unicode standard. All code points map one-to-one to single characters but not all of them are printable characters, some are formatting marks.
Examples of formatting marks you might have heard of:
- No break space
- Right-to-left mark
- Arabic shaping text
- Soft hyphens
A character is shorthand for the Unicode character a code point represents.
If you think of a hash table or map or dictionary where the code points are keys and the characters are values, that's Unicode.
Now, if you think of a hash table where byte sequences are keys and the characters are values, you basically have encodings aka character sets.
Why do we need encodings anyway? Why can't we just write the code point in the file and treat them like characters? Even though code points are just numbers, a file full of code points will be too large, twice as large as it should be assuming two-byte characters.
So if the above paragraph was written in Japanese and you try to write all the above code points separated by some other number, the text will be twice as large. That would be very bad for sending across the internet.
This is basically what the UTF-16 encoding did, however, every character was represented by two bytes or in rare cases four bytes, including English keyboard characters. It was a tremendous waste of space and bandwidth, and fortunately, no-one caught on with this encoding (almost no-one, see next section).
And, it gets even worse for UTF-16: there are actually two ways to read a UTF-16 character*. You could read the highest byte first, or the lowest byte first (the huge big endian vs. little endian debate). In Windows, which uses UTF-16 internally sometimes, the UTF-16 is stored lowest byte first (little endian).
So instead of writing the code points themselves, each code point is transformed into a sequence of one, two, three or four bytes, the transformation process being called encoding. The most well-known encoding is called UTF-8 and uses the same transformation described here. As of this writing, 94.5% of the entire web is using UTF-8 (wikipedia source).
Now If I type a bunch of English text like this, or anything else on a English keyboard, all of those characters will only take up one byte because that's how UTF-8 was designed to transform English. because the vast majority of content uses characters found on English keyboard, we end up sending way less megabytes across the internet per month than we would if English characters took two, three or four bytes.
This is also the way characters are shuttled within your computer as well. An operating system can only transfer characters if they are a sequence of bytes. (In the end, everything must become a sequence of bytes, right?) So they are transformed into UTF-8 bytes when they are stored in memory, passed to a function call, etc.
This is why it's wrong to talk about UTF-8 characters, they're not characters, they're bytes. No encoding transformation makes characters.
*Addendum that you're not required to learn: U+FEFF is the codepoint for the UTF-16 BOM (short for byte order mark). It is the first two bytes in a file or byte stream. Just remember:
0xFEFF means big endian and
0xFFEF means little endian. If you have trouble remembering which one is which, like me, maybe this will help you:
So it would all be very great if everyone used UTF-8, right? True, except billions of computers running one particular operating system are not.
Microsoft has a bunch of legacy encodings for Windows for each language that start with windows-125, because Unicode did not exist back then (we're talking about the early 1980s here). If you are reading this on an English copy of Windows then chances are, your operating system is using the Windows-1252 encoding. That's why files sometimes look crazy to Windows users, and perhaps the reason why they get boxes of question marks if they try to read Japanese Wikipedia.
NOTE: I should make it clear that Windows actually uses two encodings at the same time: UTF-16 for text entered in Windows itself as well as Unicode-enabled programs and windows-1252 for all the legacy programs that aren't using the Unicode functions. And if you're using a non-English copy of Windows it might be using some other windows-125something variant instead of windows-1252.
ISO is a standards body for making, well, standards. ISO 8859 is the standard for encodings, and they have made a bunch of encoding names under that moniker, such as ISO-8859-1 (latin1), ISO-8859-15 (latin9) and ISO character sets for many other languages. They're all pretty old standards and Unicode should be used instead.
All ISO character sets (and the Windows character sets made by Microsoft) only have 256 characters.
This is the Windows-1252 character set table. The decimal number under each character is the Unicode code point, for cross-referencing.
Here is the ISO 8859-1 aka latin1 character set table for comparison (I just thought some people would find it useful). Note the absence of control characters in this character set made it incompatible with us-ascii:
No latin9 table here because hardly anyone uses it.
(All of these tables were copy-pasted from Wikipedia)
- The big letters and symbols in the table are Unicode characters
- the small numbers below the letters are code points
- Decoding can only be done on byte streams
- Encoding can only be done on characters
- Unicode is not an encoding
If you see any function that claims to decode characters, then it's wrong, it's probably encoding them in another character set instead.
Most of the times, you don't need to worry about encoding intrinstics like this entire post when typing Unicode characters in an app. And if the app is not displaying the characters you type (or copy and paste from a character map) correctly, it is one that's not decoding the byte stream properly. It's not you picking characters from the wrong category in your character app because ultimately, everything in the clipboard is encoded in UTF-8 (or whatever encoding your operating system is using).
This character set was made in 1963 (character positions were changed in 1977 and 1986 and is the one you know now) for the purpose of transmitting English text across telegraphs which probably explains why there are so few characters. Obsoleted by the many other encodings listed above.
You are unlikely to find an encoder or decoder for this character set as the internet and all the major operating systems are using either UTF-8 or Windows-1252.
I said I would provide examples in Python, PHP and Ruby. Now it's time to make good that promise, although there isn't really much to demonstrate here besides the process of encoding and decoding text so don't expect to see snippets of a fancy webapp here. The examples are limited to re-encoding byte streams.
When decoding in software development, you must decode using the same encoding you used to encode. Otherwise, as explained above, it won't decode properly. Hence, don't do:
# This is Python code # You are looking at Unicode characters some_string="ŠpŠóŠb Python €€šÏ¥Î¬JŠ³ªº" # This will produce mojibake some_string.encode('utf-8').decode('latin1') # 'Å\xa0pÅ\xa0Ã³Å\xa0b Python â\x82¬â\x82¬Å¡Ã\x8fÂ¥Ã\x8eÂ¬JÅ\xa0Â³ÂªÂº' some_string.encode('utf-8').decode('windows-1252') # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # File "/usr/lib/python3.8/encodings/cp1252.py", line 15, in decode # return codecs.charmap_decode(input,errors,decoding_table) # UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position # 27: character maps to <undefined>
In Python, you cannot decode a character string, because it's not a sequence of bytes! Only byte strings can be decoded. Similarly, byte strings cannot be encoded because only character strings can be encoded. Remember that not all code points are characters, some are formatting marks. (In the operating system, the file is stored as bytes, and that can't be re-encoded)
Do this instead:
byte_stream = some_string.encode('utf-8') # ... many files later... latin1_string = byte_stream.decode('utf-8').encode('latin1')
PHP has the following to say about its
Many web pages marked as using the ISO-8859-1 character encoding actually use the similar Windows-1252 encoding, and web browsers will interpret ISO-8859-1 web pages as Windows-1252. Windows-1252 features additional printable characters, such as the Euro sign (€) and curly quotes (“ ”), instead of certain ISO-8859-1 control characters. This function will not convert such Windows-1252 characters correctly. Use a different function if Windows-1252 conversion is required.
So your web browser will never decode with a latin1 encoding but will use windows-1252 for decoding the bytestream instead!
Also, the names
utf8_decode() are misnomers.
utf8_encode() converts a latin1 string to UTF-8 and
utf8_decode() converts a UTF-8 string to latin1 (A reminder: latin1 is ISO-8859-1).
This is how you re-encode stuff in PHP. Since the string has already been encoded in another character set, it must be decoded by the same character set and then encoded again in the target character set:
<?php /* Convert ISO-8859-1 to UTF-8 */ $utf8_text = mb_convert_encoding($text, 'UTF-8', 'ISO-8859-1') ?>
mb_convert_encoding() will write a question mark whenever it encounters an illegal character for the target encoding. If instead of writing a question mark you want it to skip the character entirely, put
mbstring.substitute_character = "none" in your php.ini file, or set it at runtime:
<?php ini_set('mbstring.substitute_character', "none"); ?>
That's about all you get from the PHP builtin functions, I'm afraid.
In Ruby you use the
force_encoding() method to re-encode a character string (remember that encode means converting to a byte stream!), which amounts to decoding the string and then encoding it again using another character set.
There is also an
encode() function that returns the actual byte stream, but do not attempt to use it as a character string! It's not a character string! The below example will demonstrate that mojibake is printed if you try to print the result of
# These are Unicode characters... x = "łał" # That are using the UTF-8 encoding by default. puts x.encoding # => UTF8 puts x # => "łał" puts x.bytes.inspect # => [197, 130, 97, 197, 130] # Now it's using the UTF-16 encoding... utf16 = x.encode "UTF-16" puts utf16 # => "\uFEFF\u0142a\u0142" # This will be shown as gibberish because this is a # byte stream masquerading as a character string puts utf16.bytes.inspect # => [254, 255, 1, 66, 0, 97, 1, 66] # And this is using the us-ascii encoding. z = x.force_encoding "ASCII-8BIT" puts z.bytes.inspect # => [197, 130, 97, 197, 130] puts z # => "\xC5\x82a\xC5\x82" z = x.force_encoding "UTF-16" puts z # => "łał"
OK I have to admit, I groveled to this one because there are a bunch of JS devs here and a lot of JS developers are going to read this, so of course I have to make something comprehensible for them right?
These examples will be based off the node.js runtime.
Web javscript doesn't really have any encoding functions. The closest you get are
unescape() which don't actually encode any UTF-8 at all,
escape() returns a pseudo UTF-8 byte stream with a %number string instead of bytes, and
unescape() turns that back into Unicode code points. But if you are using Node.js, you are able to use a
Buffer object to convert between encodings.
Buffer object has a method called
Buffer.from(string[, encoding]). This encodes, and it actually encodes the string, not convert it to another character set like some of the above functions do (looking at you PHP), so anyway - it encodes the character string into a byte stream.
The second argument is the encoding that
string is already in. You can even use
'hex' which will interpret the string as a bunch of characters in hexadecimal form (although this is not a proper encoding).
Thanks to the wonders of the Node.js documentation and Github, I was able to quickly view the source code part which contained the supported encodings. The encoding strings that you can pass as the second argument are:
- 'utf-8' (obviously)
- 'base64' (this is not a character set, see Wikipedia)
- 'hex' (this is not a character set, see Wikipedia)
Other encodings are not supported.
It's used like this:
// Buffer.from() encodes the byte stream. const buf1 = Buffer.from('this is a tést'); const buf2 = Buffer.from('7468697320697320612074c3a97374', 'hex'); // Take buf1 and buf2 and send them across the internet somewhere // ... // toString() decodes the byte stream console.log(buf1.toString()); // Prints: this is a tést console.log(buf2.toString()); // Prints: this is a tést // Note that é cannot be encoded in the ascii character set. console.log(buf1.toString('ascii')); // Prints: this is a tC)st
- Windows: UTF-16 and Windows-1252
- MacOS: Probably latin1 (I might be wrong, see below)
- Linux: Probably UTF-8 (I might be wrong, see below)
- Linux or MacOS: Open a terminal and type
locale, and inspect the value of LANG. It looks something like
en_US.UTF-8. The text after the dot is the character set in use. You can also type
printenv LANGwhich will give you the locale all of your programs are using. And to change the default locale you can put
export LANG=insert new locale herein your .bashrc (or the startup script in case your shell is not bash).
It was probably written in Microsoft Word. When you type a quote in Word, it gets replaced by the smart quote character, which is a UTF-8 byte sequence containing bytes that amount to â, € and ™. When this is read by an email client or browser that cannot decode UTF-8, like windows-1252 or any of the ISO-8859 character sets, it will print the previous three characters because smart quote cannot be encoded in any of these character sets.
Solution: Tell the sender to follow these instructions to turn off smart quotes.
Any Unicode discussion is going to be long though. 😕
- Decoding can only be done on byte streams
- Encoding can only be done on characters
- Unicode is not an encoding
- Unicode is code points
- UTF-8 and friends are encodings
- UTF-8 and friends are byte streams not characters
I hope you've learned more about how encoding works after reading this! In fact, I learned a lot just by writing this post.
Now here are some resources you should NOT read, because if I did a good job demystifying this topic, why would you need to read another article? 😉
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - I am pleased to announce that I did not read this article during the writing of this post. 😉
- What every programmer absolutely, positively needs to know about encodings and character sets to work with text
- How Unicode Works: What every developer needs to know about strings and 🦄 - Had to Inspect Element this one to get the unicorn emoji.
- Unicode and You
If you find any errors, let me know so I can correct them.
See you next time.