DEV Community

Discussion on: A brief guide to perl character encoding

Collapse
 
fgasper profile image
Felipe Gasper

Nice job. :) A few points in response:

  1. Yes, use utf8 is IMO a mistake. It’s a premature hack around the more fundamental problem of Perl strings’ not storing their encoded/decoded state. If Perl could adopt a Unicode model that stores that, this all would be dramatically simpler.

  2. Note that most Perl built-ins that talk to the OS leak PV internals, so if your string stores code points [0x80 0x81] in upgraded/wide/UTF8 format internally, you’ll send 4 bytes to the kernel, even though filehandle ops will (correctly) send 2. My (Sys::Binmode)[metacpan.org/pod/Sys::Binmode] CPAN module fixes this; AFAICT all new Perl code should use it.

  3. I suspect a great many Perl applications have no need of decoding/encoding and can safely just regard their inputs & outputs as byte streams. This is our approach at $work, which keeps things simple for us. (Perl’s PV-leak bugs notwithstanding!)