Discussion on: 100 Languages Speedrun: Episode 68: Raku (Perl 6) Grammars

View post

Elizabeth Mattijsen • Jan 27 '22

(also) see comments on /r/rakulang

Tomasz Wegrzanowski • Jan 27 '22

The comments are generally wrong:

Raku supports self-recursive regexps. I even used some in previous episode, so I'm definitely aware of that. What it doesn't support is mutually recursive regexps, which is literally what I just said here. Mutually recursive regexps would be a lot more powerful.
I definitely blame Raku documentation. The word "PEG" isn't mentioned anywhere, and there are multiple kinds of PEG parsers. There's not a word about what kind of rules are and aren't supported.
There are many kinds of PEG parsers, some PEG parsers (paper example here) support left recursion depending on how memoization is setup. Documentation says nothing about it.
Grammar::Tracer and Grammar::Debugger debug a single specific match string, they don't report issues with the grammar itself. They might be useful for a very simple cases like what I had, but that's nowhere near proper grammar debugging tools.
Raku \d is still 100% broken, and I'm baffled by how many people defend it, but none of them can point to a single use case which would break by fixing it.

p6steve • Mar 10 '22 • Edited

Raku \d is still 100% broken, and I'm baffled by how many people defend it, but none of them can point to a single use case which would break by fixing it.
Hmmm - I hope that you will agree that perl brought a whole new generation of regexes to the fore (resulting in Perl Compatible Regular Expressions). Innovation in PCRE is, to put it mildly, stalled. Since raku (aka perl6) has less need for backward compatibility, I think that it deserves kudos for pushing regexes up to proper multilingual unicode breadth. Yes, it's a breaking change ... but it puts the Thai digit 3 (and all the others) onto the same level as the English digit 3. True that is no longer western centric.

Tomasz Wegrzanowski • Mar 10 '22

This isn't about some ideology of being whatever centric, it's about:

how common is "match exactly ASCII digits 0-9" vs "match anything Unicode calls a digit" (over 1000000:1 easily)
if you actually wanted "match anything Unicode calls a digit", was it hard before? (no it was not, Unicode property matches are super easy already)
how many bugs is it going to cause (ridiculous amount of them, as \d is super common, it always means "match exactly ASCII digits 0-9", and nobody will ever bother testing that regexp engine decided to change this)
was it worth breaking backwards compatibility? (obviously no)
also this. Yes, that's how most of the affected languages are written.

This really is a textbook example of a bug.

p6steve • Mar 11 '22 • Edited

Well - no. This is a feature that is a deliberate part of the design and is well documented: docs.raku.org/language/regexes#\d_....

A bug is when the software does not perform according to the specification. OK you do not like it, so say that. I get that you may not have time to learn new stuff when you are doing a "speedrun".

So let's say a bug is when there is some software out there that your new version of compiler breaks. Well, no again - because no one is relying on this since raku is a new language (albeit with deep roots in perl5).

You do not seem to be able to answer my main point - which is that PCRE is stagnant - with every language just blindly copying the perl4 implementation. Also that PCRE is biased to western / latin text.

Let me put it another way - unicode has this really cool set of features called properties that no other regex engine has been able to embrace. So let's say you want to design a new generation regex that supports unicode. Do you take the unicode definition of newline or the ascii definition or both? Sure raku regex is a "breaking change" to PCRE ... but it applies the KISS design principle and embraces all aspects of unicode in a single, unified approach. It eschews the idea that you should have a unicode mode and an ascii mode side by side. This is a good programming principle, right?

Raku has a very standardised design - so it applies the (very comprehensive) unicode properties to all the built in character classes where - not just \d, but \w, \n, \c etc. So for a coder that values elegance and power that is standard regardless of local language, this is a better solution than a mode bit (or manual distinctions). So it maybe that this is overkill just for \d ... but it is much more straightforward to have it everywhere the same.

Your Ezhil example is cool, but you do not explain that \w matches (etc) can be done in raku, and you would be stuck regexing Tamil without that, right? And your example will fail if someone uses a Tamil digit char instead of a latin digit char?

What if there were a programming language that can do pretty much what Ezhil can do (yes including localised / unicode operators in a sub language or 'slang') - but for every system of writing on the planet.

Oh - there is and it's called raku.

p6steve • Mar 10 '22

gist.github.com/raiph/32b3ba969b4e...

A major difference is that PEGs only support a deterministic ordered choice operator. Raku supports that but also a non-determinstic | LTM (Longest Token Matching) choice operator. Instead of trying a sequence of matches, and picking whichever alternative first matches, LTM picks whichever alternative matches the most input against the "declarative" start of its pattern. For example, matching the input aaa against . | .. | ... will match aaa, not a. This is a more natural, succinct, and algorithmically performant way to specify grammar rules than the simplistic ordered choice operator. (More algorithmically performant because the alternatives are compiled into an NFA.)
For more info about why this non-deterministic choice is significant, see the Parsing composed grammars section near the end of What are Raku Grammars? In particular, how do they compare with Parsing Expression Grammars (PEGs)?