Discussion on: 100 Languages Speedrun: Episode 47: Raku (Perl 6) Regular Expressions

View post

And here I thought that documenting a bug, would turn it into a feature? :-)

But seriously, all of the matching in Raku is based on Unicode properties. So why should \d be any different? And Tamil programmers that are used to using ௦ ௧ ௨ ௩ ௪ ௫ ௬ ௭ ௮ ௯ will be equally surprised to see 0 1 2 3 4 5 6 7 8 9 match \d. Welcome to a world where all is not ASCII!

More generally, when you are working with text, all of us in IT will need to get used to the idea that all is not ASCII. You may argue that using \d is a gigantic WAT? But I'd argue, it should be an eye opener. In that respect, thank you for POINTING THIS OUT in your blog post :-).

I just hope that people will continue to read after "I found a massive bug in Raku Regular Expressions". :-(

Tomasz Wegrzanowski • Jan 5 '22

It doesn't even handle two most popular number systems (二十 or MMXXII). Meanwhile even actual Tamils use regular ASCII numbers as you can see.

The Raku \d is simply a bug, and it will only cause problems. It's even worse than 0 prefix turning on octal quite a few languages do.

Elizabeth Mattijsen • Jan 5 '22

Actually, Raku is flexible enough to allow for Roman Numerals in a module: Slang::Roman. And who knows, it might actually make it into the language at some point.

Showing a Tamil language page that does not use Tamil numerals is only proof of the fact that at least some Tamil pages do not use Tamil numerals. It does not proof there aren't any other pages that do use Tamil numerals. And there are other uses of text beside the Web :-)

Also, Tamil was just an example. There are about 50 languages in the Unicode standard with their own numeric representation.

Re "二十", yes, perhaps we should make a slang for that as well. Anyone up for Slang::CJK?

Comparing a well thought out behaviour of a feature in Raku with a mistake made in the past, feels like a disservice. You can disagree with the decision of this behaviour, but considering it a bug is wrong:

From Wikipedia: "A software bug is an error, flaw or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways'. The result is not incorrect, nor unexpected, nor unintended.

Pardon the analogy, you're like a driver used to drive on the right side of the road, suddenly needing to drive a car with the steering wheel on the right-hand side of the car. And then wondering why the window-wipers switch on in the very first turn that you need to make.

Tomasz Wegrzanowski • Jan 5 '22

docs.raku.org/language/regexes thinks "௩.௩.௩.௩" is a valid IP address, and good luck pinging that.

Even people who wrote the page explaining the Raku \d can't actually follow how the broken \d works and assume it works on ASCII only in the rest of the document.

Elizabeth Mattijsen • Jan 5 '22

Then the documentation is where the error is.

Tomasz Wegrzanowski • Jan 5 '22

Can you find even a single actual Raku program out there, where \d is used, and it intentionally means <:Nd> and it would break the program if \d was changed to match <[0..9]>?

Unfortunately Github code search can't handle special characters like backslash so it can't search for \d directly, and it confuses Raku with Perl 5 when filtering, but here's a start: https://github.com/search?q=filename%3A%22*.raku%22+language%3ARaku&type=Code

Just clicking randomly I see a lot of \d, and ALL of them assume that \d will be ASCII digits is everywhere. Explicit <[0..9]> are very rare. Anyone wanting <:Nd>? I haven't found a single case yet.

Vadim Belman • Jan 6 '22

All I could say about pinging the IP is that your parser is just not able to convert the representation into an unsigned 32 bit integer. But it doesn't mean that no other parser is capable of this. Enough to say that dotted notation is a convention. Network addresses are just numbers in their nature.

Juan Julián Merelo Guervós • Jan 5 '22

But is there any basis to calling it a bug other than classically \d has matched only latin decimal digits? (if only because there were no others). At the end of the day, there does not seem to be a standard (other than maybe PCRE, which is a de facto standard?) so making \d == [0..9] or \d == <:Nd> is simply a judgment call. As long as it's properly documented, we're good with that, I guess.

Tomasz Wegrzanowski • Jan 5 '22

It is a bug, because \d is extremely well established to match [0-9] and this is about the most common regexp escape code, programmers will rely on this, and this "almost" works.

I think the fact that Raku documentation has this issue, on the same page even, pretty much proves it. According to that documentation ௩.௩.௩.௩ is a valid IP address.

<:Nd> is such a rare thing you'd have a lot of trouble coming with a single use case for it. If you think it can find numbers in text you don't know language of (and how often is that a thing?), it won't even do that (Chinese and Roman numbers being most obvious). And if you somehow come up with a super rare use case for <:Nd>, you can use <:Nd> - or more likely some much more specific character class like /<:Nd>&<:Tamil>/.

Elizabeth Mattijsen • Jan 5 '22

Well, I guess there is one place where \d is intended to match all numerics. And that's the grammar that Raku uses to parse Raku source code. Which allows the example Jonathan Stowe gave to work.

I see that we will not agree on whether the current behaviour is correct or not (even though apparently Raku is not the only one).

I'm looking forward to you covering of Raku grammars. :-)