UTF-8 Byte Order Mark

#unicode #utf #raku

In previous post of this series I explained that UTF is a multi byte encoding that also has few variants: UTF-8, UTF-16 and UTF-32. To make things more complicated in UTF-16 and UTF-32 there are two ways to send bytes of single code point - in big endian or little endian order.

BTW: Endianness term is not related to Indians. It comes form Gulliver's Travels book. There was a law in Lilliputians world that forced citizens to break boiled eggs from little end. Those who rebelled and were breaking eggs from big end were called "big endians".

What is Byte Order Mark?

To notify which byte order is in processed file or data stream a special sequence of bytes at the beginning was introduced, called Byte Order Mark. Or BOM for short.

For example UTF-16 can start with 0xFE 0xFF for big endian and 0xFF 0xFE for little endian order. And UTF-32 can start with 0x00 0x00 0xFE 0xFF for big endian and 0xFF 0xFE 0x00 0x00 for little one.

Impact on UTF-8

Here things gets weird. UTF-8 is constructed in such a way, that it has only one meaningful byte order, because first byte describes how many bytes will follow to get code point value.

However BOM specification has magic sequence for UTF-8, which is 0xEF 0xBB 0xBF. It only indicates encoding type, therefore has no big endian / little endian variants.

Implications

BOM idea may sound weird today, because UTF-8 became prevalent and dominant. But remember that we are talking about year 2000, when things were not that obvious.

Spec claims that if a protocol always uses UTF-8 or has some other way to indicate what encoding is being used, then it should not use BOM. So for example BOM should not appear in *.xml files:

<?xml version="1.0" encoding="UTF-8"?>
<tag>...

Or in MIME *.eml files:

--3e6ea2aa592cb31d47cefca38727f872
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="UTF-8"

Because those specify encoding internally. Unfortunately this is sometimes ignored, so if something broke your parser and you cannot find obvious error - look if file has UTF-8 BOM:

$ raku -e 'say "file.txt".IO.open( :bin ).read( 3 ) ~~ Buf.new(0xEF, 0xBB, 0xBF)'

True

Security issues

But what if BOM is not aligned with internal/assumed encoding? Let's create following file:

$ raku -e '
spurt "file.txt",
    Buf.new( 0xFE, 0xFF, 0x3c, 0x73, 0x63, 0x72, 0x69, 0x70, 0x74, 0x3e )
'

Now you upload this to some service. This service has validator that respects BOM and should strip all HTML tags. Validator sees nonsense but perfectly legal content that passes validation:

Later this service opens and displays uploaded file, but it ignores BOM and assumes UTF-8:

Oooops! If you trusted validator and displayed this file without proper HTML escaping then you have JavaScript injection. This happened because 㱳捲楰琾 in UTF-16 suggested by BOM has the same byte sequence as <script> in assumed UTF-8.

Conclusions

You should still be aware of existence of Byte Order Mark, even if it makes zero sense in UTF-8 dominated world today.

Coming up next: UTF-8 in MySQL.

DEV Community

UTF-8 Byte Order Mark

Top comments (0)

Read next

AI Language Models Struggle to Read Scrambled Words, Unlike Humans

Swiss Legal Translation Benchmark Tests AI Models Across 4 Languages in Official Documents

SampleMix: New Data Strategy Boosts Language Models with 50% Less Training Data

AI Creates Natural Human Animations from Text Using Advanced Neural Networks