If you’ve read my Perl, Unicode, and Bytes or Sys::Binmode posts, you know about the complexities of character encoding in Perl. A bit after I wrote that first post I had a little epiphany I thought worth sharing.
One day I noticed that URI::XSEscape was mangling its output: I’d pass in épée
and get out %C3%83%C2%A9p%C3%83%C2%A9e
. I recognized this as an extra UTF-8 encode: rather than URI-encoding my 6 bytes of épée
, it was UTF-8 encoding—so now 10 bytes—then URI-encoding that.
I pulled out Devel::Peek and saw that something prior to the URI-encoding step had “upgraded” my string’s internal storage: Perl itself stored my string as 10 bytes, even though the Perl scalar still consisted of 6 characters. Ordinarily this is nothing of importance since Perl code doesn’t need to care how Perl itself stores its strings.
… until it does need to care, that is.
What is SvPV?
Perl’s C API—the set of macros and functions available to work with Perl from C—is a classic C API: lots of different ways to do almost the same thing. To translate a Perl scalar to a C signed integer, for example, you can use SvIV
, SvIV_nomg
, SvIVX
, or SvIVx
. (IV
here signifying an “integer value”) A similar set of macros exists for unsigned integers (UV
s).
Converting a Perl scalar to a C string is similar. There are many tools available, but the 3 “fundamental” ones are:
SvPVbyte
: Takes the code points of your Perl string and gives back a C buffer whose bytes match those code points. Thus, any code point that exceeds 255 doesn’t work, and an exception is thrown.SvPVutf8
: LikeSvPVbyte
but gives the UTF-8-encoded bytes for your Perl string’s code points. This works for any code point that Perl can store, but for code points 128-255 it’ll give different results fromSvPVbyte
. (cf. perldoc perlunicode)SvPV
: Gives you the Perl string’s internal buffer, aka its “PV” (“pointer value”). It could be bytes, or it could be UTF-8. It’s like a C analogue to Perl’suse bytes
.
SvPV
, of course, is what URI::XSEscape was using.
For SvPV
to be meaningful it has to be used in tandem with SvUTF8
, a macro that tells you which form the PV is: bytes, or UTF-8. So if SvUTF8
is true, then SvPV
’s output is UTF-8; otherwise SvPV
’s output is bytes. But URI::XSEscape wasn’t checking SvUTF8
; it was just URI-encoding SvPV
directly.
The big problem with SvPV
is that the number of contexts other than Perl where it’s sensible to have a C string that could be bytes or UTF-8 is … small. Nevertheless, uses of this macro (and its variants) to interact with contexts outside Perl are all over CPAN.
URI::XSEscape, like its pure-Perl counterpart, presents interfaces appropriate for both “byte-oriented” and “character-oriented” Perl code (cf. Perl, Unicode, and Bytes). Since the byte-oriented interface is what I was using, switching URI::XSEscape from SvPV
to SvPVbyte
was the simple fix to this problem.
In essence, C code like URI::XSEscape should approach Perl strings the same way that pure-Perl code does, without caring about Perl’s internal string storage. Most C code should thus avoid SvPV
for the same reason that most Perl should not use bytes
.
The plot thickens …
A quick scan through some popular XS modules showed more occurrences of this problem:
- DBD::SQLite
- Net::Curl
- DNS::Unbound (mea culpa!)
- DNS::LDNS
- YAML::Syck
- HTTP::Parser::XS
- Socket (a core module!)
These offer a non-default mode that auto-encodes to UTF-8, but their default setup has the same bug:
There are likely many more; those are just ones I’ve found.
How did this come to be?
I suspect it’s that:
SvPV
is the shortest of the above-named methods for converting a Perl scalar to a C string. Thus, it’s easier to type and looks less “intimidating”.Historically, Perl’s documentation favoured
SvPV
in its examples of scalar-to-string conversion; the other two were seldom discussed. I fixed this recently, but it’ll be years before everyone’s localperldoc
reflects that change.Perl’s default XS typemap uses
SvPV
(without consultingSvUTF8
) to convert a scalar to a string. Thus, the following XSUB, called asprintstr($mystr)
:
void
printstr (const char *str)
CODE:
fprintf(stdout, str);
… prints Perl’s internals, which a Perl caller isn’t supposed to care about. Ideally language defaults like this would be the “safe” ways to do things, but this particular one is nonsensical.
Does this problem affect your code?
A simple way to test for this problem is to utf8::upgrade
your strings before you give them to the tested code—ensuring, of course, that you’re testing with some code points in the 128-255 range. Your test should verify that your program’s behaviour is the same with utf8::upgrade
d strings as with non-upgraded strings.
You wouldn’t normally upgrade strings manually in production (since it makes your Perl code think about Perl’s internals, which it shouldn’t do), but for testing it’s fine and useful.
For example, I found the URI::XSEscape problem by doing:
my $foo = "épée";
utf8::upgrade($foo);
print URI::XSEscape::uri_escape($foo);
Not just any old bug …
The worst part of all this is that modules like CDB_File can’t replace SvPV
without breaking existing applications that may depend on that use bytes
-ish behaviour. So there’s not much to do except build new, corrected interfaces, deprecating the old ones … which of course will eventually necessitate changes to existing code. For Perl “gurus” that may be simple, but for everyone else changing existing code could be expensive, painful, and even harmful to Perl’s reputation as a language that prizes backward compatibility.
But that’s not all …
XS code isn’t the only place where this bug appears; Perl itself has it, too! Read all about it at “use Sys::Binmode;”.
How can we fix this?
I think most code that uses SvPV
to convert a Perl string to a C string intends for Perl code points to correspond to bytes in the C string; thus, such code should actually use SvPVbyte
or one of its variants. (UTF-8-aware C code, of course, would use SvPVutf8
.) Toward that end, we MUST discourage further use of SvPV
. I propose to the Perl community, then, a few changes: some that don’t break anything, and others that will probably break some things:
Fixing this: The easy parts!
1) Rename SvPV
and friends. We can’t remove them, but we can create longer, “scarier-looking” aliases for them and use those names in the documentation. I propose SvPVinternal
, SvPVinternal_const
, etc.
2) Make xsubpp
warn when it sees SvPV or variants in a typemap.
3) Use Sys::Binmode in all new code to fix Perl’s own buggy behaviour.
4) Submit bug reports! Audit the XS modules that you use, and if you find different behaviour between upgraded and downgraded strings, let the maintainers know—ideally by sending them patches!
Fixing this: The hard part …
You can’t make an omelet without breaking some eggs, and you often can’t fix things like this without breaking some current applications. Nevertheless …
5) Make char *
and const char *
in Perl’s default typemap use SvPVbyte
. (Actually SvPVbyte_nolen
, but hey.) For the vast majority of XS modules this probably would be just a bug fix, though for apps that depend on a use bytes
-ish status quo there would be breakage. Thankfully, though: a) the most widely-used XS modules (e.g., MIME::Base64, JSON::XS) where this could be a problem don’t appear to be vulnerable, and b) any breakage would be easy to fix: module authors merely have to adopt SvPVutf8
if that’s what they want, optionally creating separate functions if support for both is desired.
6) Make Sys::Binmode’s behaviour Perl’s own behaviour. This is more contentious because it sidesteps the much larger problem of Perl’s lacklustre support for Windows filesystems; still, Sys::Binmode-type behaviour is no worse than Perl’s status quo, and it fixes a significant leak in Perl’s string abstraction.
Fixing this: The moon-shot …
7) Perl needs to differentiate byte sequences from text strings. This would fix a plethora of “shin-bumpers” that afflict users of the language. This is a fairly difficult problem to solve, but I don’t think it’s insurmountable.
In the meantime …
Absent fixes like the above, we just have to avoid this issue. You’ll always have consistent behaviour if you send encoded strings to the operating system and downgrade them prior to output; this way Perl doesn’t store any strings as UTF-8, so SvPV
and SvPVbyte
give the same result.
IMPORTANT: If you don’t decode your strings, then by definition they’re already encoded, so in this case don’t encode them manually, or you’ll mangle your output.
Top comments (1)
Another unfortunate occurrence of the bug: github.com/gonzus/JavaScript-Dukta...