I covered Raku and Raku regular expressions before, now it's time to complete this with an episode about Raku grammars.
Processing text is one of the most common tasks in programming, so most languages since Perl come with some kind of regular expression system. Generally the basic regexps are the same, but each language comes with some unique extensions. Raku notably does not keep the traditional regexp syntax even for the common things, but conceptually they're mostly the same.
A regular expression can extract information from a string based on some pattern, but often we need something more advanced. So a lot of languages also have some way to define tokenizers, which chop the string into pieces based on multiple regular expressions. Notably Perl's /gc
flag, and Ruby StringScanner
.
Raku takes it one step further, and supports defining grammars as part of the language. Pretty much every other language needs external tools like Yacc or ANTLR for that.
Named regular expressions
Before we get to grammars, let's take a tiny detour to regular expressions. In addition to defining them in one go, we can also define them in pieces:
#!/usr/bin/env raku
my regex digit { <[0..9]> };
my regex nonzero_digit { <[1..9]> };
my regex byte {
[
| <digit> # 0-9
| <nonzero_digit> <digit> # 10-99
| 1 <digit> <digit> # 100-199
| 2 <[0..4]> <digit> # 200-249
| 25 <[0..5]> # 250-255
]
};
my regex ipv4 { ^ <byte> ** 4 % "." $ };
my @examples = (
"1.2.3.4",
"127.0.0.1",
"100.200.300.400",
"02.03.04.05",
"1.2.3.4.5",
"it's a kitty!",
"69.420.69.420",
"10.20.30.40 ",
"127.0.0.1.",
"๓.๓.๓.๓", # \d bug still there
);
for @examples {
say "VALID: '$_'" if $_ ~~ &ipv4;
say "NOT VALID: '$_'" if $_ !~~ &ipv4;
}
Initially I thought I could even define some mutually recursive regular expressions, and get simple parsers this way, but that is not actually supported.
Notice unusual sigils we need to use in this case - <digit>
inside regular expression, and &ipv4
outside it.
Key Value Grammar
Let's start with extremely simple grammar, one that matches a single key value pair, with a single word on both sides:
#!/usr/bin/env raku
grammar KV {
rule TOP { ^ <key> "=" <value> $ };
regex id { <[a..zA..Z0..9]> + };
regex key { <id> };
regex value { <id> };
};
my @examples = (
"cat=cute",
"4=5",
"cat = fluffy",
"cat",
"cat=dog=ferret",
"c a t = d o g",
);
for @examples {
if KV.parse($_) {
say "PARSED: $_";
say "MATCH: $/";
say "KEY IS: $/.<key>";
say "VALUE IS: $/.<value>";
say "PARSE TREE:";
say $/;
} else {
say "FAILED: $_";
}
say "";
}
$ ./kv.raku
PARSED: cat=cute
MATCH: cat=cute
KEY IS: cat
VALUE IS: cute
PARSE TREE:
「cat=cute」
key => 「cat」
id => 「cat」
value => 「cute」
id => 「cute」
PARSED: 4=5
MATCH: 4=5
KEY IS: 4
VALUE IS: 5
PARSE TREE:
「4=5」
key => 「4」
id => 「4」
value => 「5」
id => 「5」
PARSED: cat = fluffy
MATCH: cat = fluffy
KEY IS: cat
VALUE IS: fluffy
PARSE TREE:
「cat = fluffy」
key => 「cat」
id => 「cat」
value => 「fluffy」
id => 「fluffy」
FAILED: cat
FAILED: cat=dog=ferret
FAILED: c a t = d o g
Let's cover everything that's going on here, and it's a lot:
- we can define
grammar
sort of like we declare aclass
- tokens are declared with
regex
(ortoken
, which has slightly different rules) - parse rules are defined with
rule
, and can refer to other rules, named regexps, or just plain strings - parse rule
TOP
is the default rule for the whole grammar - by default, whitespace is allowed between tokens, notice
cat = fluffy
parsed even though we didn't say anything about optional spaces - this does not happen within the regex, notice how
c a t = d o g
failed to parse - we can override this whitespace parsing by redefining
ws
regexp - we can use
KV.parse($string)
orKV.parsefile($filename)
to parse a string or a file - parse result is returned from
KV.parse
, and also saved to$/
, the same one used for regexp matching - we can navigate the parse tree with
.<name>
, we can also stringify any part of it, it will return the matching part
If we use grammars this way, we don't really gain much over just using regular expressions.
Actions
The second part of parsing is actions associated with each rule. In many parsing languages, grammars and their actions are defined together. In Raku they're defined separately, which makes it easy to have multiple sets actions for the same grammar.
#!/usr/bin/env raku
grammar KV {
rule TOP { ^ <key> "=" <value> $ };
regex id { <[a..zA..Z0..9]> + };
regex key { <id> };
regex value { <id> };
};
class KVActions {
method TOP($/) {
make { $/.<key>.made => $/.<value>.made }
}
method key($/) { make $/.<id>.made }
method value($/) { make $/.<id>.made }
method id($/) { make $/.Str }
};
my @examples = (
"cat=cute",
"4=5",
"cat = fluffy",
"cat",
"cat=dog=ferret",
"c a t = d o g",
);
for @examples {
if KV.parse($_, actions => KVActions) {
say "PARSED: $_";
say "RESULT: ", $/.made;
} else {
say "FAILED: $_";
}
say "";
}
$ ./actions.raku
PARSED: cat=cute
RESULT: {cat => cute}
PARSED: 4=5
RESULT: {4 => 5}
PARSED: cat = fluffy
RESULT: {cat => fluffy}
FAILED: cat
FAILED: cat=dog=ferret
FAILED: c a t = d o g
What's going on:
- we define actions with
class KVActions { }
and each action withmethod name($/) { ... }
,name
corresponding to a rule in the grammar - by the way this is regular class, so you can do regular OO stuff like inheriting from it, override specific actions, metaprogram etc.
- there are many ways to use it, but generally to create a token you do
make somedata
, and it will return match object withsomedata
associated with given rule - you can access subtrees and their associated data with
$/.<subrule>.made
- you can access text of what was matched by different subtrees with
$/.<subrule>.Str
- once you parse the whole tree, you can do
$/.made
to access result of the whole parsing - this is just the usual way of doing things, there are of course many other ways
RPN Calculator
RPN Calculator grammar is very easy - it's just a list of tokens, with optional spaces between them ignored. We could just use a tokenizer for this, but let's use a Raku grammar.
#!/usr/bin/env raku
grammar RPN {
rule TOP { ^ <tok>* $ };
rule tok { <number> | <op> };
regex number { "-"? <[0..9]>+ ["." <[0..9]>*]? };
regex op { "+" | "-" | "*" | "/" | "in" | "out" };
}
class RPNActions {
method TOP($/) { make [$<tok>.map(*.made)] }
method tok($/) { make $<number> ?? $<number>.made !! $<op>.made }
method number($/) { make {type => "number", value => +$/.Str} }
method op($/) { make {type => $/.Str} }
}
sub run($m) {
my @stack = ();
for @$m {
my $type = $_{'type'};
my $value = $_{'value'};
given $type {
when "number" { @stack.push($value) }
when "in" { @stack.push(+$*IN.get) }
when "/" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b / $a) }
when "*" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b * $a) }
when "+" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b + $a) }
when "-" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b - $a) }
when "out" { say @stack.pop }
default { die "Unknown operator $type" }
}
}
}
unless @*ARGS == 1 {
note "Usage: $*PROGRAM-NAME <file.rpn>";
exit 1;
}
my $path = @*ARGS[0];
if RPN.parsefile($path, actions => RPNActions) {
run $/.made
} else {
note "Parse error";
}
$ ./rpn.raku rpn/temperature_c_to_f.rpn
0
32
$ ./rpn.raku rpn/temperature_c_to_f.rpn
100
212
$ ./rpn.raku rpn/temperature_c_to_f.rpn
25.7
78.26
$ ./rpn.raku rpn/mile_to_km.rpn
1000
1609.34
What's going on:
- grammar has very few simple rules - optional whitespaces between tokens are implied because we didn't turn them off. I don't think this should really be the default, as it makes grammar ambiguous (what is whitespace? just spaces and tabs? newlines? any exotic Unicode stuff?), but it's useful for these toy cases
-
$<tok>
is the same as$/.<tok>
, for both normal regexps and grammars - actions for
op
andnumber
are simple enough,+
operator in Raku converts to a number - action for
tok
already looks bad with just two cases, and it would look awful if we listed every kind of op separately. Raku has better syntax to deal with this alternatives, because it absolutely needs it. - action for
TOP
is interesting - as we had<tok>*
,$<tok>
is already an array - as we only care about tokens made, we just.map(*.made)
it. Due to Raku having Perl-style scalar vs list context distinction, we need to wrap that in extra[]
s. - the whole process returns single
$/
, with array of tokens as that's what we constructed - it's not really part of the grammar, but I added simple
run
function to run our RPN program
Parser Feedback
OK, let's do something more complicated, and also let's increase our expectations as to what we want the parser to do. Here's a simple grammar for sums expressions, and yes, it is ambiguous as 2+3+4
can be parsed as 2+(3+4)
or (2+3)+4
. But we won't be feeding it any such input:
#!/usr/bin/env raku
grammar Math {
rule TOP { ^ <expr> $ };
rule expr { <number> | <parens> | <operation> };
rule parens { "(" <expr> ")" };
rule operation { <expr> <op> <expr> };
regex number { "-"? <[0..9]>+ ["." <[0..9]>*]? };
regex op { "+" };
};
my @examples = (
"100",
"-4.20",
"((5))",
"2+3",
"(2+3)",
);
for @examples {
if Math.parse($_) {
say $/
} else {
note "Parse error: $_"
}
}
Let's give it a try:
$ ./math.raku
「100」
expr => 「100」
number => 「100」
「-4.20」
expr => 「-4.20」
number => 「-4.20」
「((5))」
expr => 「((5))」
parens => 「((5))」
expr => 「(5)」
parens => 「(5)」
expr => 「5」
number => 「5」
Parse error: 2+3
^C
It completely fails to parse 2+3
, and it just hangs there on (2+3)
. What?
Grammar Debugging
I was really baffled by what was going on. I could make the grammar completely unambiguous in a few ways, but that didn't solve the problem at all - different grammars I'd write would result in different baffling issues.
Eventually it occurred to me that this is neither LL-style or LR-style parsing I was expecting, but PEG, which is highly sensitive to order of alternatives. It basically YOLOs any ambiguity by always picking the first matching alternative as soon as it can, even if it then can't match the rest. I only have experience with LR and LL parsing, not PEG, so I dropped my idea of writing any big grammars in Raku.
Anyway, the problem isn't that I run into such issues, or that it's PEG parsing. The problem is that traditional parser generator tools normally come with tools to debug grammar issues; and if you write recursive descend parser by hand, it's just functions calling other functions which you can debug the usual ways. Raku just lacks such tools. If you have problems with your grammar, tough luck.
And while this problem would apply to every kind of parsing, PEG grammars are far more sensitive to tiny changes than traditional LL and LR grammars, so you'll need debugging tools more.
Working Sums Grammar
I eventually figured out one way to make PEG grammars work for simple sums. The problem was that Raku was of no help and I had to figure it out from just the theory:
#!/usr/bin/env raku
grammar Math {
rule TOP { ^ <expr> $ };
rule expr { <value> + % <op> };
rule value { <parens> | <number> };
rule parens { "(" <expr> ")" };
regex number { "-"? <[0..9]>+ ["." <[0..9]>*]? };
regex op { "+" };
};
my @examples = (
"100",
"-4.20",
"((5))",
"2+3",
"(2+3)",
"2+3+4",
);
for @examples {
if Math.parse($_) {
say $/
} else {
note "Parse error: $_"
}
}
$ ./math2.raku
「100」
expr => 「100」
value => 「100」
number => 「100」
「-4.20」
expr => 「-4.20」
value => 「-4.20」
number => 「-4.20」
「((5))」
expr => 「((5))」
value => 「((5))」
parens => 「((5))」
expr => 「(5)」
value => 「(5)」
parens => 「(5)」
expr => 「5」
value => 「5」
number => 「5」
「2+3」
expr => 「2+3」
value => 「2」
number => 「2」
op => 「+」
value => 「3」
number => 「3」
「(2+3)」
expr => 「(2+3)」
value => 「(2+3)」
parens => 「(2+3)」
expr => 「2+3」
value => 「2」
number => 「2」
op => 「+」
value => 「3」
number => 「3」
「2+3+4」
expr => 「2+3+4」
value => 「2」
number => 「2」
op => 「+」
value => 「3」
number => 「3」
op => 「+」
value => 「4」
number => 「4」
This could be extended to other operators, and in the end my solution is somewhat similar to what an LL grammar would look like.
Should you use Raku Grammars?
I definitely support the idea of adding grammars to the language, and it's already useful enough for libraries like JSON::Tiny
, but right now I don't think this is good enough execution yet.
You as grammar writer won't get any feedback about the issues. And users won't get any feedback about why their input isn't matching.
For anyone interested in implementing grammar parsing in their own language, I'd recommend going the LL(*)
route like ANTLR does it, not PEG route. PEG grammars can be good enough for some things, but they're so much harder to write and debug, and a lot more restrictive. But either way, you really need to work on your tooling a lot more than Raku did.
Code
All code examples for the series will be in this repository.
Code for the Raku (Perl 6) Grammars episode is available here.
Top comments (9)
There is convenient
match
method that makesless cryptic for non Perl/Raku users:
As for
\d
"bug" - that behavior is also a Perl norm:which could be explicitly disabled by
/a
switch: perldoc.perl.org/perlre#/a-(and-/aa). So I don't think it is fair to blame Raku for breaking backward compatibility. However I think Raku should geta
switch as well to be more user friendly.This is a very recent optional bug in Perl which Raku mindlessly copied. Back when Perl was widely used,
\d
simply meant[0-9]
, and it still does by default in Perl.In PCRE
\d
always meant[0-9]
, in every mode (all UTF-* modes too), and so did most languages derived from Perl regexps like JavaScript, Ruby etc., in every encoding. This isn't anything specific to one encoding or another,[0-9]
is just an extremely common pattern, including with Unicode data, so it makes so much sense that a shortcut for it was added.And there are virtually no use cases for buggy
\d
. There are no protocols or languages or such where Unicode propertyNd
is what people need, and if that somehow happens, they can ask for unicode propertyNd
. For 99.9999% of programs using\d
, it will just match more than author intended due to the\d
bug. Even if I somehow wanted to matchNd
in Raku or other language with broken\d
, I'd just requestNd
property check to make it clear that this is one of the 0.0001% of situations where I want the Unicode property.The only choices language has are - either have correct
\d
meaning[0-9]
, or not have\d
and tell people to use[0-9]
. Buggy\d
is not acceptable choice.I see and egg and chicken issue here. You say that
In PCRE \d always meant [0-9]
, but you forget that Perl defined Perl Compatible Regular Expressions, not the other way around. So PCRE should either follow Perl current behavior or fork into something like OPCRE - Old/Outdated Perl Compatible Regular Expressions to keep\d == [0-9]
. One cannot claim to be "compatible" and at the same time ignore current state of thing one was derived from.As for usecases - you are too much focused on your numeric system. Using the same arguments American can say that
\w+
matching 'zażółć' is a "bug" because there are no other meaningful letters except abcdefghijklmnopqrstuwvxyz. There are many cultures using different numbers - for example in arabic you can see both "100" and "۱۰۰" used. Digit is a digit and I find new behavior to be more correct / consistent.(also) see comments on /r/rakulang
The comments are generally wrong:
Grammar::Tracer
andGrammar::Debugger
debug a single specific match string, they don't report issues with the grammar itself. They might be useful for a very simple cases like what I had, but that's nowhere near proper grammar debugging tools.gist.github.com/raiph/32b3ba969b4e...
A major difference is that PEGs only support a deterministic ordered choice operator. Raku supports that but also a non-determinstic | LTM (Longest Token Matching) choice operator. Instead of trying a sequence of matches, and picking whichever alternative first matches, LTM picks whichever alternative matches the most input against the "declarative" start of its pattern. For example, matching the input aaa against . | .. | ... will match aaa, not a. This is a more natural, succinct, and algorithmically performant way to specify grammar rules than the simplistic ordered choice operator. (More algorithmically performant because the alternatives are compiled into an NFA.)
For more info about why this non-deterministic choice is significant, see the Parsing composed grammars section near the end of What are Raku Grammars? In particular, how do they compare with Parsing Expression Grammars (PEGs)?
Raku \d is still 100% broken, and I'm baffled by how many people defend it, but none of them can point to a single use case which would break by fixing it.
Hmmm - I hope that you will agree that perl brought a whole new generation of regexes to the fore (resulting in Perl Compatible Regular Expressions). Innovation in PCRE is, to put it mildly, stalled. Since raku (aka perl6) has less need for backward compatibility, I think that it deserves kudos for pushing regexes up to proper multilingual unicode breadth. Yes, it's a breaking change ... but it puts the Thai digit 3 (and all the others) onto the same level as the English digit 3. True that is no longer western centric.
This isn't about some ideology of being whatever centric, it's about:
\d
is super common, it always means "match exactly ASCII digits 0-9", and nobody will ever bother testing that regexp engine decided to change this)This really is a textbook example of a bug.
Well - no. This is a feature that is a deliberate part of the design and is well documented: docs.raku.org/language/regexes#\d_....
A bug is when the software does not perform according to the specification. OK you do not like it, so say that. I get that you may not have time to learn new stuff when you are doing a "speedrun".
So let's say a bug is when there is some software out there that your new version of compiler breaks. Well, no again - because no one is relying on this since raku is a new language (albeit with deep roots in perl5).
You do not seem to be able to answer my main point - which is that PCRE is stagnant - with every language just blindly copying the perl4 implementation. Also that PCRE is biased to western / latin text.
Let me put it another way - unicode has this really cool set of features called properties that no other regex engine has been able to embrace. So let's say you want to design a new generation regex that supports unicode. Do you take the unicode definition of newline or the ascii definition or both? Sure raku regex is a "breaking change" to PCRE ... but it applies the KISS design principle and embraces all aspects of unicode in a single, unified approach. It eschews the idea that you should have a unicode mode and an ascii mode side by side. This is a good programming principle, right?
Raku has a very standardised design - so it applies the (very comprehensive) unicode properties to all the built in character classes where - not just \d, but \w, \n, \c etc. So for a coder that values elegance and power that is standard regardless of local language, this is a better solution than a mode bit (or manual distinctions). So it maybe that this is overkill just for \d ... but it is much more straightforward to have it everywhere the same.
Your Ezhil example is cool, but you do not explain that \w matches (etc) can be done in raku, and you would be stuck regexing Tamil without that, right? And your example will fail if someone uses a Tamil digit char instead of a latin digit char?
What if there were a programming language that can do pretty much what Ezhil can do (yes including localised / unicode operators in a sub language or 'slang') - but for every system of writing on the planet.
Oh - there is and it's called raku.