Discussion on: 100 Languages Speedrun: Episode 47: Raku (Perl 6) Regular Expressions

View post

If I got this correctly, you're implying that \d should only match ASCII digits, right? We should use Nd to match any unicode digit, and not \d. The massive bug is to make \d == `<:Nd>

Tomasz Wegrzanowski • Jan 5 '22

Yes, it is a massive bug. It causes a lot of programs to match a lot more than they expect, including very likely a lot of security validations. Everyone including people who wrote those docs assumes \d matches ASCII digits only, and this is needed for basically any parsing of either machine format or human text.

It is exceedingly rare to want to match <:Nd> (I double anyone ever actually used that), and if you absolutely need to, well, you can say <:Nd>, or more likely some more specific range.

It won't even do for extracting numbers from natural language text, as most common numerical systems (Roman and Chinese numerals) don't match <:Nd> as they reuse letters.

Juan Julián Merelo Guervós • Jan 5 '22

They don't really reuse letter codepoints; they use a different codepoint in Unicode. They match <:N> alright, and also <:Nl>:

raku -e 'say "Ⅻ " ~~ /<:Nl>/'
｢Ⅻ｣

Tomasz Wegrzanowski • Jan 5 '22

Nice one, I didn't know they had separate characters for Roman numerals in Unicode. I don't think it's actually used in the wild much, still, nice.