DEV Community

Mansour Moufid
Mansour Moufid

Posted on • Updated on

Unicode bug on macOS

A diacritical mark is a mark on letter that conveys meaning like a change in pronunciation, like an accent. Diacritical marks in Unicode can have one or more encodings.

For example the letter c with a cedilla (ç) can be the single character:

Character UTF-8 encoding Name
ç 0xc3 0xa7 LATIN SMALL LETTER C WITH CEDILLA

or as two:

Character UTF-8 encoding Name
c 0x63 LATIN SMALL LETTER C
0xcc 0xa7 COMBINING CEDILLA

The first is known as "normal form C" (NFC). The second is called "normal form D" (NFD) -- it is the canonical decomposition of normal form C.

Something interesting happens when you save a file with a name in NFC: macOS will not open it.

For example:

filename = 'français.txt'
with open(filename, 'wt') as f:
    print('Bonjour!', file=f)
Enter fullscreen mode Exit fullscreen mode

Try to double-click this file in Finder. TextEdit will show in the Dock, but nothing else happens.

Now close TextEdit, delete the file, and change the file name to its canonical decomposition (NFD):

import os
import unicodedata

filename = 'français.txt'
filename = unicodedata.normalize('NFD', filename)

try:
    os.remove(filename)
except FileNotFoundError:
    pass
with open(filename, 'wt') as f:
    print('Bonjour!', file=f)
Enter fullscreen mode Exit fullscreen mode

TextEdit will open this file just fine.

If you're wondering why a bunch of your files stopped opening after the latest upgrade to macOS (13.3.1), this is why...

Top comments (0)