geraldew

Posted on Oct 5

Character Encoding with the Python os module and Unicode

#python #unicode

I realised today that I hadn't published some notes I made about how:

Python, and
its "os" module

handle unspecified character encodings.

This was something I had to tackle when getting my program Foldatry to handle old file systems.

Note that this was only about file and folder names, I've not yet checked how it goes for file contents.

My decisions were:

to let the Python os module get names from the file system and put them into a Python Unicode string - i.e. that I wouldn't write code to tell it how to interpret the encoding of the names until after it had stored them in Python variables;
to write the following function for use in displaying path-file-names (to the terminal or to screen widgets) - this was vital because otherwise the program would simply crash.

def pfn4print( p_pfn ):
    return p_pfn.encode('utf-8', 'surrogateescape').decode('cp1252', 'backslashreplace')

Note that cp1252 is the Windows-1252 superset of the ISO-8859-1 encoding standard.

The rationale for that code is:

the Python os module will by-default have put the bytes it got from the file system into a Python Unicode string;
we then want to turn that back into the original byte sequence, which is achieved by encode('utf-8', 'surrogateescape')
this works because the documentation for the os module says that's what was done to make the Python Unicode string i.e. that it used 'surrogateescape'

Then for display it is using the Windows-1252 codec for two reasons:

it was the most common for a long time: longer than the earlier MS-DOS sets and than other language sets;
because it has representations for nearly all the 256 characters of the 8-bit byte patterns, so it should usually show something even if it's not what was originally intended.

The alternative to this would have been to always handle all the path and filenames as bytestrings, and that seemed too much work.

Now if you know lots about Unicode, you may know that not all bytes sequences are valid in its standard encodings.

So what does Python the os module do when it encounters these (i.e. in byte sequences that it didn't get told how to decode)?

Well, this is where 'surrogateescape' comes in, as this effectively has Python store the bad byte sequence as:

it wasn't valid; and
what its sequence was.

Which is why encode('utf-8', 'surrogateescape') gives us back the original bytes.

Now I do grant that this seems a bit "magic". While you certainly could look deeper into how:

Python handles Unicode internally;
as well as all the different encodings it knows about;
and all the options available as it encodes and decodes;

but my guess is that you can mostly get-by without going that deep, and just trusting "surrogateescape" to do its trick.

Note that this is all for program doesn't crash handling. Ultimately, correct handling of unspecified character encoding is a matter of working out which encoding was/is valid for it. There are tools which will take a good go at doing that, but to be sure of it will require human judgement - essentially because it got there by human misadventure.

p.s. here's function that converts a string to a list of its Unicode ordinals (i.e. the sequence of code points). Handy for checking what Unicode string Python really thinks it has.

def string_as_list_of_code_points( p_str):
    return list( ord(a_char) for a_char in list(p_str) )

Example 1

This comes from a CD-ROM that I'd made back in the Windows 98 era. As a consequence, the filenames definitely were not done as Unicode.

Our example filename here is:

für where that letter in the middle is the "LATIN SMALL LETTER U WITH DIAERESIS"

Encodings

On the CD-ROM the encoding is the sequence of these three bytes:

0x66 0xFC 0x72

Read into Python3 -via the os module - it becomes this sequence of Unicode code points:

U+0066 U+FFFD!!FC U+0072

If we succeed with conversion - see below - then Python holds this sequence of Unicode "code points":

U+0066 U+00FC U+0072

If Python writes that out to UTF-8 it becomes these four bytes:

0x66 0xC3 0xBC 0x72 being: a one-byte character, a two-byte character, and a one-byte character.

My understanding of what Windows-NT uses internally, is "UCS-2" which in bytes, would be:

0x00 0x66 0x00 0xFC 0x00 0x72 being three two-byte characters.

See:

LATIN SMALL LETTER U WITH DIAERESIS
from the Unicode group: Latin-1 Supplement

Python sequence

Okay, let's see how that works in a Python interactive session. The following was clipped over from a terminal window.

$ python3
Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s_0 = "für"
>>> print( list( ord(a_char) for a_char in list(s_0) ) )
[102, 252, 114]
>>> ba_1 = s_0.encode( "utf-8", "surrogateescape")
>>> print( list( ba_1 ) )
[102, 195, 188, 114]
>>> ba_2 = s_0.encode("cp1252", "backslashreplace")
>>> print( list( ba_2 ) )
[102, 252, 114]
>>> s_1 = ba_2.decode("utf-8", "surrogateescape")
>>> print( list( ord(a_char) for a_char in list(s_1) ) )
[102, 56572, 114]
>>> ba_3 = s_1.encode( "utf-8", "surrogateescape")
>>> print( list( ba_2 ) )
[102, 252, 114]
>>> s_2 = ba_3.decode("cp1252")
>>> print( list( ord(a_char) for a_char in list(s_2) ) )
[102, 252, 114]
>>> print( s_2)
für

To recap that relative to the earlier commentary:

ba_2 is the way this string would have been on the CD-ROM
s_1 is what happens when we ask Python os to read ba_2 without specifying an encoding - so it presumes UTF-8 but then has to cope with the fact that 252 is not valid as the first byte of a UTF-* character, so os stores it as a surrogate
we can see how it is properly encoded as UTF-8 in ba_1

Depending on your familiarity with code points and encoding, there might be two surprises in that:

that UTF-8 encoding of three characters required four bytes - essentially because the "ü" required two bytes
the matter of "things that wouldn't be valid in UTF-8" then being handled in a strange way, but that can be reversed

Similarly, something that might not catch your attention, is:

that this example is still simple because the archaic Windows-1252 byte encoding for "ü" matches its Unicode code point number: 252

Oh, and:

I don't actually remember where I got the notation U+FFFD!!FC from.
I might have even made it up, as a hybrid way to say "will show as the Unicode Replacement Character but knows it came from hex FC"
If you look at the Python session, you'll see code point in the middle of s_1 is decimal 56572 which is 0xDCFC hexadecimal

Example 2

So now let's look at an example, where the Windows-1252 byte encoding byte for a character does not match its Unicode code point number.

the euro symbol
€30
in Windows-1252 this is encoded as 0x80
in Unicode this is code point U+20AC (decimal 8364)
(in case you're wondering, this difference is because Microsoft decided to use places in the "C1 control codes" range, that Unicode chose to leave alone)

See:

Euro Sign

Python sequence

Here is the same sequence of actions as we did for Example 1 but this time with a short string starting with the Euro symbol.

>>> s_0 = "€30"
>>> print( list( ord(a_char) for a_char in list(s_0) ) )
[8364, 51, 48]
>>> ba_1 = s_0.encode( "utf-8", "surrogateescape")
>>> print( list( ba_1 ) )
[226, 130, 172, 51, 48]
>>> ba_2 = s_0.encode("cp1252", "backslashreplace")
>>> print( list( ba_2 ) )
[128, 51, 48]
>>> s_1 = ba_2.decode("utf-8", "surrogateescape")
>>> print( list( ord(a_char) for a_char in list(s_1) ) )
[56448, 51, 48]
>>> ba_3 = s_1.encode( "utf-8", "surrogateescape")
>>> print( list( ba_2 ) )
[128, 51, 48]
>>> s_2 = ba_3.decode("cp1252")
>>> print( list( ord(a_char) for a_char in list(s_2) ) )
[8364, 51, 48]
>>> print( s_2)
€30

Things to note this time:

ba_1 is the UTF-8 encoding, note that the Euro symbol requires three bytes
ba_2 is the way this string would have been in Windows 98 et al
s_1 is what happens when we ask Python os to read ba_2 without specifying an encoding - so it presumes UTF-8 but then has to cope with what it gets
we still have the matter of "things that wouldn't be valid in UTF-8" being handled in a strange way, but that can be reversed
you'll see this time, the code point at the front of s_1 is decimal 56448 which is 0xDC80 hexadecimal

End note

I should perhaps add that I'm not claiming the methods shown here are either complete or bulletproof. This is merely what I added to my program to cope with what it was encountering. I do plan to revisit all this at some later date but there's a lot of important other features that will come first. So, not "real soon now" at all.

DEV Community

Character Encoding with the Python os module and Unicode

Example 1

Encodings

Python sequence

Example 2

Python sequence

End note

Top comments (0)

Read next

Flatten in PyTorch

How to Use PySpark for Machine Learning

ChatWithSQL — Secure, Schema-Validated Text-to-SQL Python Library, Eliminating Arbitrary Query Risks from LLMs

Thursday Quiz