Discussion on: Perl, Unicode, and Bytes

View post

A very difficult topic to cover, well done, and 100% agree with the conclusions.

Minor nit: I would refer to Perl's internal upgraded encoding as "approximately UTF-8" - it follows all of the same structure as UTF-8, so all valid UTF-8 is valid in Perl's internal encoding, but the reverse is not necessarily true, because Perl's internal encoding does not have restrictions on noncharacters, surrogates, or code points over U+10FFFF; indeed it allows storing any ordinal, because Perl strings don't necessarily represent Unicode characters until they're used as such.

And more importantly, unless you are writing XS code you should not depend on it being UTF-8 adjacent anyway - Perl could switch its internal string encoding to UTF-16LE and correctly-written pureperl code would work the same.

Felipe Gasper • Feb 9 '21

Thank you! I updated the post a bit to address these points.