I realised today that I hadn't published some notes I made about how:
- Python, and
- its "os" module
handle unspecified character encodings.
This was something I had to tackle when getting my program Foldatry to handle old file systems.
Note that this was only about file and folder names, I've not yet checked how it goes for file contents.
My decisions were:
- to let the Python os module get names from the file system and put them into a Python Unicode string - i.e. that I wouldn't write code to tell it how to interpret the encoding of the names until after it had stored them in Python variables;
- to write the following function for use in displaying path-file-names (to the terminal or to screen widgets) - this was vital because otherwise the program would simply crash.
def pfn4print( p_pfn ):
return p_pfn.encode('utf-8', 'surrogateescape').decode('cp1252', 'backslashreplace')
Note that cp1252 is the Windows-1252 superset of the ISO-8859-1 encoding standard.
The rationale for that code is:
- the Python os module will by-default have put the bytes it got from the file system into a Python Unicode string;
- we then want to turn that back into the original byte sequence, which is achieved by encode('utf-8', 'surrogateescape')
- this works because the documentation for the os module says that's what was done to make the Python Unicode string i.e. that it used 'surrogateescape'
Then for display it is using the Windows-1252 codec for two reasons:
- it was the most common for a long time: longer than the earlier MS-DOS sets and than other language sets;
- because it has representations for nearly all the 256 characters of the 8-bit byte patterns, so it should usually show something even if it's not what was originally intended.
The alternative to this would have been to always handle all the path and filenames as bytestrings, and that seemed too much work.
Now if you know lots about Unicode, you may know that not all bytes sequences are valid in its standard encodings.
So what does Python the os module do when it encounters these (i.e. in byte sequences that it didn't get told how to decode)?
Well, this is where 'surrogateescape' comes in, as this effectively has Python store the bad byte sequence as:
- it wasn't valid; and
- what its sequence was.
Which is why encode('utf-8', 'surrogateescape')
gives us back the original bytes.
Now I do grant that this seems a bit "magic". While you certainly could look deeper into how:
- Python handles Unicode internally;
- as well as all the different encodings it knows about;
- and all the options available as it encodes and decodes;
but my guess is that you can mostly get-by without going that deep, and just trusting "surrogateescape" to do its trick.
Note that this is all for program doesn't crash handling. Ultimately, correct handling of unspecified character encoding is a matter of working out which encoding was/is valid for it. There are tools which will take a good go at doing that, but to be sure of it will require human judgement - essentially because it got there by human misadventure.
p.s. here's function that converts a string to a list of its Unicode ordinals (i.e. the sequence of code points). Handy for checking what Unicode string Python really thinks it has.
def string_as_list_of_code_points( p_str):
return list( ord(a_char) for a_char in list(p_str) )
Example 1
This comes from a CD-ROM that I'd made back in the Windows 98 era. As a consequence, the filenames definitely were not done as Unicode.
Our example filename here is:
-
für
where that letter in the middle is the "LATIN SMALL LETTER U WITH DIAERESIS"
Encodings
On the CD-ROM the encoding is the sequence of these three bytes:
-
0x66
0xFC
0x72
Read into Python3 -via the os
module - it becomes this sequence of Unicode code points:
-
U+0066
U+FFFD!!FC
U+0072
If we succeed with conversion - see below - then Python holds this sequence of Unicode "code points":
-
U+0066
U+00FC
U+0072
If Python writes that out to UTF-8 it becomes these four bytes:
-
0x66
0xC3
0xBC
0x72
being: a one-byte character, a two-byte character, and a one-byte character.
My understanding of what Windows-NT uses internally, is "UCS-2" which in bytes, would be:
-
0x00
0x66
0x00
0xFC
0x00
0x72
being three two-byte characters.
See:
- LATIN SMALL LETTER U WITH DIAERESIS
- from the Unicode group: Latin-1 Supplement
Python sequence
Okay, let's see how that works in a Python interactive session. The following was clipped over from a terminal window.
$ python3
Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s_0 = "für"
>>> print( list( ord(a_char) for a_char in list(s_0) ) )
[102, 252, 114]
>>> ba_1 = s_0.encode( "utf-8", "surrogateescape")
>>> print( list( ba_1 ) )
[102, 195, 188, 114]
>>> ba_2 = s_0.encode("cp1252", "backslashreplace")
>>> print( list( ba_2 ) )
[102, 252, 114]
>>> s_1 = ba_2.decode("utf-8", "surrogateescape")
>>> print( list( ord(a_char) for a_char in list(s_1) ) )
[102, 56572, 114]
>>> ba_3 = s_1.encode( "utf-8", "surrogateescape")
>>> print( list( ba_2 ) )
[102, 252, 114]
>>> s_2 = ba_3.decode("cp1252")
>>> print( list( ord(a_char) for a_char in list(s_2) ) )
[102, 252, 114]
>>> print( s_2)
für
To recap that relative to the earlier commentary:
- ba_2 is the way this string would have been on the CD-ROM
- s_1 is what happens when we ask Python
os
to read ba_2 without specifying an encoding - so it presumes UTF-8 but then has to cope with the fact that 252 is not valid as the first byte of a UTF-* character, soos
stores it as a surrogate - we can see how it is properly encoded as UTF-8 in ba_1
Depending on your familiarity with code points and encoding, there might be two surprises in that:
- that UTF-8 encoding of three characters required four bytes - essentially because the "ü" required two bytes
- the matter of "things that wouldn't be valid in UTF-8" then being handled in a strange way, but that can be reversed
Similarly, something that might not catch your attention, is:
- that this example is still simple because the archaic Windows-1252 byte encoding for "ü" matches its Unicode code point number: 252
Oh, and:
- I don't actually remember where I got the notation
U+FFFD!!FC
from. - I might have even made it up, as a hybrid way to say "will show as the Unicode Replacement Character but knows it came from hex FC"
- If you look at the Python session, you'll see code point in the middle of s_1 is decimal 56572 which is 0xDCFC hexadecimal
Example 2
So now let's look at an example, where the Windows-1252 byte encoding byte for a character does not match its Unicode code point number.
- the euro symbol
€30
- in Windows-1252 this is encoded as
0x80
- in Unicode this is code point U+20AC (decimal 8364)
- (in case you're wondering, this difference is because Microsoft decided to use places in the "C1 control codes" range, that Unicode chose to leave alone)
See:
Python sequence
Here is the same sequence of actions as we did for Example 1 but this time with a short string starting with the Euro symbol.
>>> s_0 = "€30"
>>> print( list( ord(a_char) for a_char in list(s_0) ) )
[8364, 51, 48]
>>> ba_1 = s_0.encode( "utf-8", "surrogateescape")
>>> print( list( ba_1 ) )
[226, 130, 172, 51, 48]
>>> ba_2 = s_0.encode("cp1252", "backslashreplace")
>>> print( list( ba_2 ) )
[128, 51, 48]
>>> s_1 = ba_2.decode("utf-8", "surrogateescape")
>>> print( list( ord(a_char) for a_char in list(s_1) ) )
[56448, 51, 48]
>>> ba_3 = s_1.encode( "utf-8", "surrogateescape")
>>> print( list( ba_2 ) )
[128, 51, 48]
>>> s_2 = ba_3.decode("cp1252")
>>> print( list( ord(a_char) for a_char in list(s_2) ) )
[8364, 51, 48]
>>> print( s_2)
€30
Things to note this time:
- ba_1 is the UTF-8 encoding, note that the Euro symbol requires three bytes
- ba_2 is the way this string would have been in Windows 98 et al
- s_1 is what happens when we ask Python
os
to read ba_2 without specifying an encoding - so it presumes UTF-8 but then has to cope with what it gets - we still have the matter of "things that wouldn't be valid in UTF-8" being handled in a strange way, but that can be reversed
- you'll see this time, the code point at the front of s_1 is decimal 56448 which is 0xDC80 hexadecimal
End note
I should perhaps add that I'm not claiming the methods shown here are either complete or bulletproof. This is merely what I added to my program to cope with what it was encountering. I do plan to revisit all this at some later date but there's a lot of important other features that will come first. So, not "real soon now" at all.
Top comments (0)