DEV Community 👩‍💻👨‍💻

David Cantrell
David Cantrell

Posted on

A brief guide to perl character encoding

Credits

I originally wrote this at work, after my team spent far too many days yelling at the computer because of Mojibake. Thanks to my employer for allowing me to publish it, and the several colleagues who provided helpful feedback. Any errors are, naturally, not their fault.

Table of Contents

  1. 12:45. Restate my assumptions
  2. The Royal Road
  3. The Encode module
  4. Debugging
  5. The many ways of writing a character

12:45. Re-state my assumptions

We will normally want to read and write UTF-8 encoded data. Therefore you should make sure that your terminal can handle it. While we will occasionally have to deal with other encodings, and will often want to look at the byte sequences that we are reading and writing and not just the characters they represent, your life will still be much easier if you have a UTF-8 capable terminal. You can test your terminal thus:

$ perl -E 'binmode(STDOUT, ":encoding(UTF-8)"); say "\N{GREEK SMALL LETTER LAMDA}"'
Enter fullscreen mode Exit fullscreen mode

That should print λ, a letter that looks a bit like a lower-case y mirrored through the horizontal axis.

And if you pipe the output from that into hexdump -C you should see the byte sequence 0xce 0xbb 0x0a.

The Royal Road

Ideally, your code will only have to care about any of this at the edges - that is, where data enters and leaves the application. That could be when reading or writing a file, sending/receiving data across the network, making system calls, or talking to a database. And in many of these cases - especially talking to a database - you will be using a library which already handles everything for you. In a brand new code-base which doesn’t have to deal with any legacy baggage you should, in theory, only have to read this first section of this document.

Alas, most real programming is a habitation of devils, who will beset you from all around and make you have to care about the rest of it.

Characters, representations, and strings

Perl can work with strings containing any character in Unicode. Characters are written in source code either as a literal character such as "m" or in several other ways. These are all equivalent:

"m"
chr(0x6d) # or chr(109), of course
"\x{6d}"
"\N{U+6d}"
"\N{LATIN SMALL LETTER M}"
Enter fullscreen mode Exit fullscreen mode

As are these:

chr(0x3bb)
"\x{3bb}"
"\N{U+3bb}"
"\N{GREEK SMALL LETTER LAMDA}"
Enter fullscreen mode Exit fullscreen mode

Non-ASCII characters can also appear as literals in your code, for example "λ", but this is not recommended - see the discussion of the utf8 pragma below. You can also use octal - "\154" - but this too is not recommended as hexadecimal encodings are marginally more familiar and easier to read.

Internally, characters have a representation, a sequence of bytes that is unique for a particular combination of character and encoding. Most modern languages default to using UTF-8 for that representation, but perl is old enough to pre-date UTF-8 - and indeed to pre-date any concern for most character sets. For backward-compatibility reasons, and for compatibility with the many C libraries for which perl bindings exist, it was decided when perl sprouted its Unicode tentacle that the default representation should be ISO-Latin-1. This is a single-byte character set that covers most characters used in most modern Western European languages, and is a strict superset of ASCII.

Any string consisting solely of characters in ISO-Latin-1 will by default be represented internally in ISO-Latin-1. Consider these strings:

Release the raccoon! - consists solely of ASCII characters. ASCII is a subset of ISO-Latin-1, so the string’s internal representation is an ISO-Latin-1-encoded string of bytes.

Libérez le raton laveur! - consists solely of characters that exist in ISO-Latin-1, so the string’s internal representation is an ISO-Latin-1-encoded string of bytes. The "é" character has code point 0xe9 and is represented as the byte 0xe9 internally.

Rhyddhewch y racŵn! - the "ŵ" does not exist in ISO-Latin-1. But it does exist in Unicode, with code point 0x175. As soon as perl sees a non-ISO-Latin-1 character in a string, it switches to using something UTF-8-ish, so code point 0x175 is represented by byte sequence 0xc5 0xb5. Note that while valid characters’ internal representations are valid UTF-8 byte sequences, this can also encode invalid characters.

Libérez le raton laveur! Rhyddhewch y racŵn! - this contains both an "é" (which is in ISO-Latin-1) and a "ŵ" (which is not), so the whole string is UTF-8 encoded. The "ŵ" is as before encoded as byte sequence 0xc5 0xb5, but the "é" must also be UTF-8 encoded instead of ISO-Latin-1-encoded, so becomes byte sequence 0xc3 0xa9.

But notice that ISO-Latin-1 not only contains ASCII, and characters like "é" (at code point 0xe9, remember), it also contains characters "Ã" (capital A with a tilde, code point 0xc3) and "©" (copyright symbol, code point 0xa9). So how do we tell the difference between the ISO-Latin-1 byte sequence 0xc3 0xa9 representing "é" and the UTF-8 byte sequence 0xc3 0xa9 representing "é"? Remember that a representation is "a sequence of bytes that is unique for a particular combination of character and encoding". So perl stores the encoding as well as the byte sequence. It is stored as a single bit flag. If the flag is unset then the sequence is ISO-Latin-1, if it is set then it is UTF-8.

Source code encoding, the utf8 pragma, and why you shouldn’t use it

It is possible to put non-ASCII characters into your source code. For example, consider this file:

my $string = "é";

print "$string contains ".length($string)." characters\n";
Enter fullscreen mode Exit fullscreen mode

from which some problems arise. First, if the file is encoded in UTF-8, how can perl tell when it comes across the byte sequence 0xc3 0xa9 what encoding that is? Is it ISO-Latin-1? Well, it could be. Is it UTF-8? Again, it could be. In general, it isn’t possible to tell from a sequence of bytes what encoding is in use. For backward-compatibility reasons, perl assumes ISO-Latin-1.

If you save that file encoded in UTF-8, and have a UTF-8-savvy terminal, that code will output:

é contains 2 characters
Enter fullscreen mode Exit fullscreen mode

which is quite clearly wrong. It interpreted the 0xc3 0xa9 as two characters, but then when it spat those two characters out your terminal treated them as one.

We can tell perl that the file contains UTF-8-encoded source code by adding a use utf8. We also need to fix the output encoding - use utf8 doesn’t do that for you, it only asserts that the source file is UTF-8 encoded:

use utf8;
binmode(STDOUT, ":encoding(UTF-8)");

my $string = "é";

print "$string contains ".length($string)." character\n";
Enter fullscreen mode Exit fullscreen mode

(For more on output encoding see the next section)

And now we get this:

é contains 1 character
Enter fullscreen mode Exit fullscreen mode

Hurrah!

At this point a second problem arises. Some editors aren’t very clever about encodings and even if they correctly read a file that is encoded in UTF-8, they will save it in ISO-Latin-1. VSCode for example is known to do this at least some of the time. If that happens, you’re still asserting via use utf8 that the file is UTF-8, but the "é" in the sample file will be encoded as byte 0xe9, and the following double-quote and semicolon as 0x22 0x3b. This results in a fatal error:

Malformed UTF-8 character: \xe9\x22\x3b (unexpected non-continuation byte 0x22,
immediately after start byte 0xe9; need 3 bytes, got 1) at ...
Enter fullscreen mode Exit fullscreen mode

So given that you’re basically screwed if you have non-ASCII source code no matter whether you use utf8 or not, I recommend that you just don’t do it. If you need a non-ASCII character in your code, use any of the many other ways of specifying it, and if necessary put a comment nearby so that whoever next has to fiddle with the code knows what it is:

chr(0xe9);   # e-acute
Enter fullscreen mode Exit fullscreen mode

Input and output

Strings aren’t the only things that have encodings. File handles do too. Just like how perl defaults to assuming that your source code is encoded in ISO-Latin-1, it assumes unless told otherwise that file handles similarly are ISO-Latin-1, and so if you try to print "é" to a a handle, what actually gets written is the byte 0xe9.

Even if your source code has the use utf8 pragma, and your code contains the byte sequence 0xc3 0xa9, which will internally by decoded as the character "é", your handles are still ISO-Latin-1 and you'll get a single byte for that character. For how this happens see "PerlIO layers" below.

Things get a bit more interesting if you try to send a non-ISO-Latin-1 character to an ISO-Latin-1 handle. Perl does the best it can and sends the internal representation - which is UTF-8, remember - to the handle and emits a warning "Wide character in print". Pay attention to the warnings!

This behaviour is another common source of bugs. If you send the two strings "Libérez le raton laveur!" followed by "Rhyddhewch y racŵn!" to an ISO-Latin-1 handle, then the first one will sail through, correctly encoded, but the second will also go through. You’ve now got two different character encodings in your output stream and no matter what encoding is expected at the other end you’ll get mojibake.

PerlIO layers

We’ve seen how by default input and output is assumed to be in ISO-Latin-1. But that can be changed. Perl has supported different encodings for I/O since the dawn of time - since at least perl 3.016. That’s when it started to automatically convert "\n" into "\r\n" and vice versa on MSDOS, and the binmode() function was introduced in case you wanted to open a file on DOS without any translation.

These days this is implemented via PerlIO layers, which allows you to open a file with all kinds of translation layers, including those which you write yourself or grab from the CPAN (see for example File::BOM). You can also add and remove layers from an already open handle.

In general these days, you always want to read/write UTF-8 or raw binary, so will open files something like this:

open(my $fh, ">:encoding(UTF-8)", "some.log")

open(my $fh, "<:raw", "image.jpg")
Enter fullscreen mode Exit fullscreen mode

or to change the encoding of an already open handle:

binmode(STDOUT, ":encoding(UTF-8)")
Enter fullscreen mode Exit fullscreen mode

(NB that encodings applied to bare-word file handles such as STDOUT have global effect!)

Provided that we don’t have to worry about Windows, we generally will only ever have one layer doing anything significant on a handle (on Windows the :crlf layer is useful in addition to any others, to cope with Windows’s endearing backward-compatibility with CP/M), but it's possible to have more. In general, when a handle is opened for reading, encodings are applied to data in the order that they are specified in the open() function call, from left to right. When writing, they are applied from right to left.

If you ever think you need more than one layer, or want a layer other than those in the examples above, see PerlIO.

The Encode module

The above explains the "royal road", where you are in complete control of how data gets into and out of your code. In that situation, you should never need to re-encode data, as it will always be Just A Bunch Of Characters whose underlying representation you don’t care about. That is, however, often not the case in the real world where we are beset by demons. We sometimes have to deal with libraries that do their own encoding/decoding and expect us to supply them with a byte stream (XML::LibXML, for example), or which have had incorrect or partial bug fixes applied for any of the problems mentioned above and for which we can’t easily provide a proper fix because of other code now relying on the buggy behaviour (by for example having work-arounds to correct badly-encoded data).

Encode::encode

The Encode::encode() function takes a string of characters and returns a string of bytes that represent that string in your desired encoding. For example:

my $string = "Libérez le raton laveur!";
encode("UTF-8", $string, Encode::FB_CROAK|Encode::LEAVE_SRC);
Enter fullscreen mode Exit fullscreen mode

will return a string where the character "é" has been replaced by the two bytes 0xc3 0xa9. If the original string was encoded in UTF-8 then the underlying representation of the input and output strings will be the same, but their encodings (as stored in the single bit flag we mentioned earlier) will be different, and the output will be reported as being one character longer by the length() function.

Encode::encode can sometimes for Complicated Internals Optimisation Reasons modify its input. To avoid this set the Encode::LEAVE_SRC bit in its third argument.

If you are encoding to anything other than UTF-8 or your string may contain characters outside of Unicode then you should consider telling encode() to be strict about characters that it can't encode, such as if you try to encode "ŵ" into a ISO-Latin-1 byte sequence. That's what the Encode::FB_CROAK bit is about in the example - in real code the encode should be in a try/catch block to deal with the exception that may arise. Encode's documentation has a whole section on handling malformed data.

Encode::decode

It is quite common for us to receive data, either from a network connection or from a library, which is a UTF-8-encoded byte stream. Naively treating this as ISO-Latin-1 characters will lead to doom and disaster, as the byte sequence 0xc3 0xa9 will, as already explained, be interpreted as the characters "Ã" and "©". Encode::decode() takes a bunch of bytes and turns them into characters assuming that they are in a specified encoding. For example, this will return a "é" character:

decode("UTF-8", chr(0xc3).chr(0xa9), Encode::FB_CROAK)
Enter fullscreen mode Exit fullscreen mode

You should consider how to handle a byte stream that turns out to not be valid in your desired encoding and again I recommend use of Encode::FB_CROAK.

Encode:: everything else

The "Encode" module provides some other functions that, on the surface, look useful. They are, mostly, not.

Remember how waaaay back I briefly mentioned that perl’s internal representation for non-ISO-Latin-1 characters was UTF-8-ish and how they could contain invalid characters? That’s why you shouldn’t use encode_utf8 or decode_utf8. You may be tempted to use Encode::is_utf8() to check a string's encoding. Don't, for the same reason.

You will generally not be calling encode() with a string literal as its input, but with a variable as its input. However, any errors like "Modification of a read-only value attempted" are your fault, you should have told it to Encode::LEAVE_SRC.

Don't even think about using the _utf8_on and _utf8_off functions. They are only useful for deliberately breaking things at a lower level than you should care about.

Debugging

the UTF8 flag

The UTF8 flag is a reliable indicator that the underlying representation uses multiple bytes per non-ASCII character, but that’s about it. It is not a reliable indicator whether a string’s underlying representation is valid UTF-8 or that the string is valid Unicode.

The result of this:

Encode::encode("UTF-8", chr(0xe9), 8)
Enter fullscreen mode Exit fullscreen mode

is a string whose underlying representation is valid UTF-8 but the flag is off.

This, on the other hand has the flag on but the underlying representation is not valid UTF-8 because the character is out of range:

chr(2097153)
Enter fullscreen mode Exit fullscreen mode

This is an invalid character in Unicode, but perl encodes it (it has to encode it so it can store it) and turns the UTF8 flag on (so that it knows how the underlying representation is encoded):

chr(0xfff8)
Enter fullscreen mode Exit fullscreen mode

And finally, this variable that someone else’s broken code might pass to you contains an invalid encoding of a valid character:

my $str = chr(0xf0).chr(0x82).chr(0x82).chr(0x1c);
Encode::_utf8_on($str);
Enter fullscreen mode Exit fullscreen mode

Devel::Peek

This is a very useful module for looking at the internals of perl variables, in particular for looking at what perl thinks the characters are and what their underlying representation is. It exports a Dump() function, which prints details about its argument’s internal structure to STDERR. For example:

$ perl -MDevel::Peek -E 'Dump(chr(0xe9))'
SV = PV(0x7fa98980b690) at 0x7fa98a00bf90
  REFCNT = 1
  FLAGS = (PADTMP,POK,READONLY,PROTECT,pPOK)
  PV = 0x7fa989408170 "\351"\0
  CUR = 1
  LEN = 10

Enter fullscreen mode Exit fullscreen mode

For the purposes of debugging character encoding issues, the two important things to look at are the lines beginning with FLAGS = and PV =. Note that there is no UTF8 flag set, indicating that the string uses the single-byte ISO-Latin-1 encoding. And the string’s underlying representation is shown (in octal, annoyingly), as "\351".

And here’s what it looks like when the string contains code points outside ISO-Latin-1, or has been decoded from a byte stream into UTF-8:

$ perl -MDevel::Peek -E 'Dump(chr(0x3bb))'
SV = PV(0x7ff37e80b090) at 0x7ff388012390
  REFCNT = 1
  FLAGS = (PADTMP,POK,READONLY,PROTECT,pPOK,UTF8)
  PV = 0x7ff37f907350 "\316\273"\0 [UTF8 "\x{3bb}"]
  CUR = 2
  LEN = 10
Enter fullscreen mode Exit fullscreen mode

Notice that the UTF8 flag has appeared, and that we are shown both the underlying representation as two octal bytes "\316\273" and the characters (in hexadecimal if necessary - mmm, consistency) that those bytes represent.

hexdump

For debugging input and output I recommend the external hexdump utility. Feed it a file and it will show you the bytes therein, avoiding any clever UTF-8-decoding that your terminal might do if you were to simply cat the file:

$ cat greek
αβγ
$ hexdump -C greek
00000000  ce b1 ce b2 ce b3 0a                              |.......|
00000007
Enter fullscreen mode Exit fullscreen mode

It can of course also read from STDIN.

PerlIO::get_layers

Once you’re sure that your code isn’t doing anything perverse, but your data is still getting screwed up on input/output you can see what encoding layers are in use on a handle with the PerlIO::get_layers function. PerlIO is a Special built-in namespace, you don’t need to use it. Indeed, if you do try to use it you will fail, as it doesn’t exist as a module. Layers are returned in an array, in the order that you would tell open() about them.

Layers can apply to any handle, not just file handles. If you’re dealing with a socket then remember that they have both an input side and an output side which may have different layers - see the PerlIO manpage for details. And also see the doco if you care about the difference between :utf8 and :encoding(UTF-8) - although if you diligently follow the sage advice in this document you won’t care, because you won’t use :utf8.

The many ways of writing a character

There are numerous different ways of representing a character in your code.

String literals

"m"
Enter fullscreen mode Exit fullscreen mode

For the reasons outlined above please only use this for ASCII characters.

The chr function

This function takes a number as its argument and returns the character with the corresponding codepoint. For example, chr(0x3bb) returns λ.

Octal

You can use up to three octal digits "\155" for ISO-Latin-1 characters only but please don’t. It’s a less familiar encoding than hexadecimal so hex is marginally easier to read, and it also suffers from the “how long is this number” problem described below.

Hexadecimal

"\x{e9}"

Enter fullscreen mode Exit fullscreen mode

You can put any number of hexadecimal digits between the braces. There is also a version of this which doesn’t use braces: "\xe9". It can only take one or two hexadecimal digits and so is only valid for ISO-Latin-1 characters. The lack of delimiters can lead to confusion and error. Consider "\xa9". Brace-less \x can take one or two hex digits, so is that \xa (a line-feed character) followed by the digit 9, or is it \xa9, the copyright symbol? Brace-less \x is greedy, so if it looks like there are two hex digits it will assume that there are. Only if the first digit is followed by the end-of-string or by a non-hex-digit will it assume that you meant to use the single digit form. This means that \xap, for example, is a single hex digit, so is equivalent to \x{0a}p, a new line followed by the letter p. I think you will agree that use of braces makes things much clearer, so the brace-less variant is deprecated.

By codepoint name

"\N{GREEK SMALL LETTER LAMDA}"
Enter fullscreen mode Exit fullscreen mode

This may sometimes be preferable to providing the (hexa)decimal codepoint with an associated comment, but it gets awful wordy awful fast. By default the name must correspond exactly to that in the Unicode standard. Shorter aliases are available if you ask for them, via the charnames pragma. The documentation only mentions this for the Greek and Cyrillic scripts, but they are available for all scripts which have letters. For example, these are equivalent:

"\x{5d0}"

\N{HEBREW LETTER ALEF}"

use charnames qw(hebrew);
"\N{ALEF}"                  # א
Enter fullscreen mode Exit fullscreen mode

Be careful if you ask for character-set-specific aliases as there may be name clashes. Both Arabic and Hebrew have a letter called "alef", for example:

use charnames qw(arabic);
"\N{ALEF}"                  # ا

use charnames qw(arabic hebrew);
"\N{ALEF}"                  # Always Hebrew, no matter the order of the imports!
Enter fullscreen mode Exit fullscreen mode

A happy medium ground is to ask for :short aliases:

use charnames qw(:short);
"\N{ALEF}"                           # error
"\N{hebrew:alef} \N{arabic:alef}"    # does what it says on the tin
Enter fullscreen mode Exit fullscreen mode

Other hexadecimal

"\N{U+3bb}"
Enter fullscreen mode Exit fullscreen mode

This notation looks a little bit more like the U-ish hexadecimal notations used in other languages while also being a bit like the \N{...} notation for codepoint names. Unless you want to mix hexadecimal along with codepoint names you should probably not use this, and prefer \x{...} which is more familiar to perl programmers.

In regular expressions

You can use any of the \x and \N{...} variants in regular expressions. You may also see \p, \P, and \X as well. See perlunicode and perlrebackslash. You should consider use of the /a modifier as that does things like force \d to only match ASCII and not, say, which looks like 8 but is actually BENGALI DIGIT FOUR.

ASCII-encoded JSON strings in your code

You may need to embed JSON strings in your code, especially in tests. I recommend that JSON should always be ASCII-encoded as this minimises the chances of it getting mangled anywhere. This introduces yet another annoying way of embedding a bunch of hex digits into text. This example:

use JSON;

to_json(chr(0x3c0), { ascii => 1 });
Enter fullscreen mode Exit fullscreen mode

will produce the string "\u03c0". That’s the sequence of eight characters " \ u 0 3 c 0 ". The double quotes are how JSON says “this is a string”, and the two characters \ and u are how JSON says “here comes a hexadecimal code point”. If you want to put ASCII-encoded JSON in your code then you need to be careful about quoting and escaping.

Perl will treat the character sequence \ u as a real back-slash followed by the letter when it is single-quoted, but in general it is always good practice to escape a back-slash that you want to be a real back-slash, to avoid confusion to the reader who may not have been paying attention to whether you’re single- or double-quoting, or in case you later change the code to use double-quotes and interpolate some variable:

my $json = '"I like \\u03c0, especially Greek pie"';

# or double-quoted with interpolation
my $json = qq{"I like \\u03c0, especially $nationality pie"};
Enter fullscreen mode Exit fullscreen mode

Accented character vs character + combining accent

For many characters there are two different valid ways of representing them. chr(0xe9) is LATIN SMALL LETTER E WITH ACUTE. The same character can be obtained with the two codepoints "e".chr(0x301) - that is LATIN SMALL LETTER E and COMBINING ACUTE ACCENT.

Whether those should sort the same, compare the same, or one should be converted to t’other will vary depending on your application, so the best I can do is point you at Unicode::Normalize.

Top comments (13)

Collapse
 
fgasper profile image
Felipe Gasper

Nice job. :) A few points in response:

  1. Yes, use utf8 is IMO a mistake. It’s a premature hack around the more fundamental problem of Perl strings’ not storing their encoded/decoded state. If Perl could adopt a Unicode model that stores that, this all would be dramatically simpler.

  2. Note that most Perl built-ins that talk to the OS leak PV internals, so if your string stores code points [0x80 0x81] in upgraded/wide/UTF8 format internally, you’ll send 4 bytes to the kernel, even though filehandle ops will (correctly) send 2. My (Sys::Binmode)[metacpan.org/pod/Sys::Binmode] CPAN module fixes this; AFAICT all new Perl code should use it.

  3. I suspect a great many Perl applications have no need of decoding/encoding and can safely just regard their inputs & outputs as byte streams. This is our approach at $work, which keeps things simple for us. (Perl’s PV-leak bugs notwithstanding!)

Collapse
 
fgasper profile image
Felipe Gasper

David: when do you find that knowledge of Perl’s internal encoding (Devel::Peek, UTF8 flag) is useful?

I’ve come to think that these are never, in fact, useful unless you’re doing something wrong already. Even for XS, I can’t think of a case where that’s needed.

Collapse
 
drhyde profile image
David Cantrell

If the code I'm working on has got confused and I need to patch it to unconfuse matters, then checking exactly what is in the variable is, I find, the quickest way to figure out what needs doing.

Collapse
 
fgasper profile image
Felipe Gasper • Edited on

How do the Perl internals make a difference, though? Is this some area where the abstraction leaks?

Thus far, my perception is that the only areas where string internals leak are:

  • the various built-ins (which Sys::Binmode fixes)
  • Assignments to $0 (PR awaiting approval)
  • buggy XS modules

Are there more places than those?

Thread Thread
 
drhyde profile image
David Cantrell

Looking at the internals is the easiest way to understand what state someone else has managed to get the data into.

Thread Thread
 
fgasper profile image
Felipe Gasper

I’m still wondering why it would matter whether someone got the data into internal-UTF8 or internal-bytes. Are you unable to use unicode_strings?

Thread Thread
 
drhyde profile image
David Cantrell

unicode_strings won't help when the problem is "print does weird stuff" because the data is already broken by the time my code gets it.

Thread Thread
 
fgasper profile image
Felipe Gasper • Edited on

But print doesn’t care what the string internals are …

> perl -C -e'my $foo = "\xe9"; print "$foo\n"'
é

> perl -C -e'my $foo = "\xe9"; utf8::upgrade($foo); print "$foo\n"'
é
Enter fullscreen mode Exit fullscreen mode

This article refers folks to Perl internals but doesn’t describe when it is (and isn’t) useful to look at them. In a lot of cases it’s a red herring that can reinforce incorrect mental models about how all of this works.

Collapse
 
mjgardner profile image
Mark Gardner

That “LAMDA” spelling stuck out to me, but it’s correct in the context of Unicode. This is “motivated by the pre-existing names in ISO 8859-7 […] as well as by preferences expressed by the Greek National Body.” unicode.org/mail-arch/unicode-ml/y...

Collapse
 
drhyde profile image
David Cantrell • Edited on

Yeah, it's wrong, but at least it's standardly wrong :-) My compose key mapping will accept both versions: github.com/DrHyde/configurations/b...

And FWIW when I was at school the confusion between Hebrew alef and Arabic alef wouldn't have existed either. They were "aleph" (Hebrew) and "alif" (Arabic).

Collapse
 
pillboxhat1 profile image
pillbox hat

I wouldn't say it's wrong, standardly or not standardly :) "Lamda" is the letter in the modern Greek alphabet, hence it was correctly named as thus in ISO 8859-7 which encoded modern Greek and that got copied to the "Greek and Coptic" unicode block, which is also intended to encode modern Greek. The ancient-Greek only letters are in the "Greek Extended" block, where if a λ appeared it would be called with the classic Greek spelling "Lambda", but of course it's not there as it's essentially the same letter.

Thread Thread
 
drhyde profile image
David Cantrell

The OED prefers "lambda". For "lamda" it says "see lambda". The version with a b has been more common in English since time immemorial.

Letter names in Unicode seem to be, as far as is practical, spelled in English, and how modern Greeks prefer to spell it in Greek isn't very important. See also LATIN SMALL LETTER SHARP S, which is spelled in English, and not in German as ESZETT or SCHARFES S.

Thread Thread
 
pillboxhat1 profile image
pillbox hat

I was not saying "lamda" is the correct spelling for the "Greek letter λ" in general. I was saying, it is the correct spelling for the very specific usage of the character in the context of naming the modern greek character set. It is a different meaning from "lambda", which is a word you can indeed find in an English dictionary and it is used internationally in reference to classics and math (or maths if you are British) and more.
You are not supposed to look into a definition for a character name, none of the myriad characters of various scripts have one, there is AFAIK some sort of committee process which tries to get the best english names/transliterations.
I would certainly not consider "LAMBDA" to be "wrong" or "less right" if it had been used instead, but, technically, "LAMDA" is not "wrong" either. The inverse would be wrong of course, you can't use "lamda" for anything else (there's no lamda calculus).
And it's not important in any case, as long as nobody changes it and breaks old code :D

Hey 😍

Want to help the DEV Community feel more like a community?

Head over to the Welcome Thread and greet some new community members!

It only takes a minute of your time, and goes a long way!