DEV Community

Discussion on: A brief guide to perl character encoding

Collapse
 
fgasper profile image
Felipe Gasper • Edited

How do the Perl internals make a difference, though? Is this some area where the abstraction leaks?

Thus far, my perception is that the only areas where string internals leak are:

  • the various built-ins (which Sys::Binmode fixes)
  • Assignments to $0 (PR awaiting approval)
  • buggy XS modules

Are there more places than those?

Thread Thread
 
drhyde profile image
David Cantrell

Looking at the internals is the easiest way to understand what state someone else has managed to get the data into.

Thread Thread
 
fgasper profile image
Felipe Gasper

I’m still wondering why it would matter whether someone got the data into internal-UTF8 or internal-bytes. Are you unable to use unicode_strings?

Thread Thread
 
drhyde profile image
David Cantrell

unicode_strings won't help when the problem is "print does weird stuff" because the data is already broken by the time my code gets it.

Thread Thread
 
fgasper profile image
Felipe Gasper • Edited

But print doesn’t care what the string internals are …

> perl -C -e'my $foo = "\xe9"; print "$foo\n"'
é

> perl -C -e'my $foo = "\xe9"; utf8::upgrade($foo); print "$foo\n"'
é
Enter fullscreen mode Exit fullscreen mode

This article refers folks to Perl internals but doesn’t describe when it is (and isn’t) useful to look at them. In a lot of cases it’s a red herring that can reinforce incorrect mental models about how all of this works.