DEV Community

A brief guide to perl character encoding

David Cantrell on January 31, 2022

Credits I originally wrote this at work, after my team spent far too many days yelling at the computer because of Mojibake. Thanks to my...

Read full post

Felipe Gasper • Feb 1 '22

Nice job. :) A few points in response:

Yes, use utf8 is IMO a mistake. It’s a premature hack around the more fundamental problem of Perl strings’ not storing their encoded/decoded state. If Perl could adopt a Unicode model that stores that, this all would be dramatically simpler.
Note that most Perl built-ins that talk to the OS leak PV internals, so if your string stores code points [0x80 0x81] in upgraded/wide/UTF8 format internally, you’ll send 4 bytes to the kernel, even though filehandle ops will (correctly) send 2. My (Sys::Binmode)[metacpan.org/pod/Sys::Binmode] CPAN module fixes this; AFAICT all new Perl code should use it.
I suspect a great many Perl applications have no need of decoding/encoding and can safely just regard their inputs & outputs as byte streams. This is our approach at $work, which keeps things simple for us. (Perl’s PV-leak bugs notwithstanding!)

Felipe Gasper • Feb 7 '22

David: when do you find that knowledge of Perl’s internal encoding (Devel::Peek, UTF8 flag) is useful?

I’ve come to think that these are never, in fact, useful unless you’re doing something wrong already. Even for XS, I can’t think of a case where that’s needed.

David Cantrell • Feb 8 '22

If the code I'm working on has got confused and I need to patch it to unconfuse matters, then checking exactly what is in the variable is, I find, the quickest way to figure out what needs doing.

Felipe Gasper • Feb 8 '22 • Edited

How do the Perl internals make a difference, though? Is this some area where the abstraction leaks?

Thus far, my perception is that the only areas where string internals leak are:

the various built-ins (which Sys::Binmode fixes)
Assignments to $0 (PR awaiting approval)
buggy XS modules

Are there more places than those?

David Cantrell • Feb 8 '22

Looking at the internals is the easiest way to understand what state someone else has managed to get the data into.

Felipe Gasper • Feb 8 '22

I’m still wondering why it would matter whether someone got the data into internal-UTF8 or internal-bytes. Are you unable to use unicode_strings?

David Cantrell • Feb 8 '22

unicode_strings won't help when the problem is "print does weird stuff" because the data is already broken by the time my code gets it.

Felipe Gasper • Feb 9 '22 • Edited

But print doesn’t care what the string internals are …

> perl -C -e'my $foo = "\xe9"; print "$foo\n"'
é

> perl -C -e'my $foo = "\xe9"; utf8::upgrade($foo); print "$foo\n"'
é

This article refers folks to Perl internals but doesn’t describe when it is (and isn’t) useful to look at them. In a lot of cases it’s a red herring that can reinforce incorrect mental models about how all of this works.

Mark Gardner • Jan 31 '22

That “LAMDA” spelling stuck out to me, but it’s correct in the context of Unicode. This is “motivated by the pre-existing names in ISO 8859-7 […] as well as by preferences expressed by the Greek National Body.” unicode.org/mail-arch/unicode-ml/y...

David Cantrell • Jan 31 '22 • Edited

Yeah, it's wrong, but at least it's standardly wrong :-) My compose key mapping will accept both versions: github.com/DrHyde/configurations/b...

And FWIW when I was at school the confusion between Hebrew alef and Arabic alef wouldn't have existed either. They were "aleph" (Hebrew) and "alif" (Arabic).

pillbox hat • Feb 8 '22

I wouldn't say it's wrong, standardly or not standardly :) "Lamda" is the letter in the modern Greek alphabet, hence it was correctly named as thus in ISO 8859-7 which encoded modern Greek and that got copied to the "Greek and Coptic" unicode block, which is also intended to encode modern Greek. The ancient-Greek only letters are in the "Greek Extended" block, where if a λ appeared it would be called with the classic Greek spelling "Lambda", but of course it's not there as it's essentially the same letter.

David Cantrell • Feb 8 '22

The OED prefers "lambda". For "lamda" it says "see lambda". The version with a b has been more common in English since time immemorial.

Letter names in Unicode seem to be, as far as is practical, spelled in English, and how modern Greeks prefer to spell it in Greek isn't very important. See also LATIN SMALL LETTER SHARP S, which is spelled in English, and not in German as ESZETT or SCHARFES S.

pillbox hat • Feb 8 '22

I was not saying "lamda" is the correct spelling for the "Greek letter λ" in general. I was saying, it is the correct spelling for the very specific usage of the character in the context of naming the modern greek character set. It is a different meaning from "lambda", which is a word you can indeed find in an English dictionary and it is used internationally in reference to classics and math (or maths if you are British) and more.
You are not supposed to look into a definition for a character name, none of the myriad characters of various scripts have one, there is AFAIK some sort of committee process which tries to get the best english names/transliterations.
I would certainly not consider "LAMBDA" to be "wrong" or "less right" if it had been used instead, but, technically, "LAMDA" is not "wrong" either. The inverse would be wrong of course, you can't use "lamda" for anything else (there's no lamda calculus).
And it's not important in any case, as long as nobody changes it and breaks old code :D