Introduction to grammars with Raku

#perl6 #grammars #textprocessing #raku

In general, processing a text involves a lot of regular expressions plus splits plus general swearing, huffing and puffing. Grammars fix that. They take text on one side and spew data on the other side. They have been called (and will be called again) regexes on steroids, but while regexes focus on extracting and understanding parts of a text, grammars focus on analyzing the whole structure, understanding the syntax and how different parts of the text relate to each other.
Apparently, Perl6 is the only language that includes grammars as a basic feature, and it does so from the get go. Which is why I am going to use it for this post.

Also, because there does not seem to be a lot of it in dev.to. So I could as well start with this.

So let's say we want to have a nano-markdown grammar of enhanced text; it's basically text which might or might not have * around specific words for enhancement. So we want to work with paragraphs such as this one:

"This includes one *enhanced* word"

The structure we want to gather from this would effectively be words which might or might not be enhanced. Later on we can do stuff with them, like converting into HTML, but right now what we want is the basic structure. We can express this in Perl6 like so:

grammar Enhanced-Paragraph {
    token TOP { <superword>[ (\s+) <superword>]+ }
    token superword { <word> | <enhanced-word> }
    token word { \w+ }
    token enhanced-word { \* <word> \* }
}

We've go to call it grammar, and Perl6 allows identifiers with dashes in the middle. So we're good with calling it Enhanced-Paragraph. We need the TOP rule at the top, which says that this kind of paragraph includes at least one superword followd by whitespace (\s+, taking the syntax from regular expressions) and one or more superwords; that is what the + behind the square brackets mean. Tokens which are going to be defined later are presented between <> and the whole thing is similar to a block of code surrounded by {}. We need to define tokens, which is what the grammar is composed of. Also rules and regexes, but for the time being, this is enough.
The superword rule indicates that a paragraph can include either normal words or (that's what the | means) an enhanced-word. Once again, we can use dashes to name the tokens, no problem.

In fact, Perl6 can use all kind of alphabetic unicode symbols. More on that later on.

We go down to the deepest level, defining word and we have to use a regular expression here, \w+ meaning one or more alphanumeric symbols. 0 is a word, as well as why or þor, but not ----. And an enhanced-word would be just like a word, only surrounded by *, which we have to escape using \ since they have a meaning within grammars, 0 or more repetitions of a thing. Whitespace is not significant here, it's just used for making everything a bit more comprehensible.

So this is it. We can use right away like this:

grammar Enhanced-Paragraph {
    token TOP { <superword>[ (\s+) <superword>]+ }
    token superword { <word> | <enhanced-word> }
    token word { \w+ }
    token enhanced-word { \* <word> \* }
}

my $paragraph = "þor is *mighty*";
my $parsed = Enhanced-Paragraph.parse($paragraph);
say $parsed;

Will say:

｢þor is *mighty*｣
 superword => ｢þor｣
  word => ｢þor｣
 0 => ｢ ｣
 superword => ｢is｣
  word => ｢is｣
 0 => ｢ ｣
 superword => ｢*mighty*｣
  enhanced-word => ｢*mighty*｣
   word => ｢mighty｣

Giving you the lowdown of the structure: three components, one of which happens to be an enhanced word. Perl6 uses sigils for variables, with $ being used for any kind of variable; also my, which is actually a scope declaration and makes it a lexical variable within the current scope. Meaning: don't care much about type, but use it right here.
We use grammars as if they were objects with the method parse: the grammar parses whatever is thrown its way, and if it understands it, creates a nice data structure, which we are storing in the variable $parsed. We can just print that variable, Perl6 by default with create a nice data structure for us, just like the one we have seen. That data structure includes the input ｢þor is *mighty*｣ which uses the nice quoting construct which is also peculiar to Perl6, and then an array of key-value pairs that include as key the kind of construct (like superword) and as value whatever is inside it, once again using the square quotes.

We can go a bit further and turn it into an executable:

#!/usr/bin/env perl6

grammar Enhanced-Paragraph {
    token TOP { <superword>[ (\s+) <superword>]+ }
    token superword { <word> | <enhanced-word> }
    token word { \w+ }
    token enhanced-word { \* <word> \* }
}

sub MAIN ( Str $paragraph ) {
    my $parsed = Enhanced-Paragraph.parse($paragraph);
    say $parsed;
}

sub MAIN in Perl6 declares something that can be run, which is not strictly needed, but it is nice if you want to have some variable with a particular type, like a Str or string in this case, which we assign to $paragraph. This function will destructure the arguments, a simple thing in this case, and put them into the variable for us to use.
It can be run after the chmod +x or whatever incantation is of use in your favourite operating system

$ ./mini-grammar.p6 "We *want* *cookies*"
｢We *want* *cookies*｣
 superword => ｢We｣
  word => ｢We｣
 0 => ｢ ｣
 superword => ｢*want*｣
  enhanced-word => ｢*want*｣
   word => ｢want｣
 0 => ｢ ｣
 superword => ｢*cookies*｣
  enhanced-word => ｢*cookies*｣
   word => ｢cookies｣

We have done nothing but printing the resulting structure so far, but the objective of this short but actually longer that I intended to begin with post was to show how grammars turn meaningless text into structure. More on this later on. Maybe.

Acknowledgements

And many thanks to Moritz, who explained to me how actions worked and fixed my code here

Don't leave just yet

Continue to the next installment in the series on matching things with Perl 6 grammars

Top comments (1)

habere-et-dispertire • Mar 4 '24

Thank you for this gentle and informative introduction to raku grammars. 😀

I tried to add clarity to the example by using the tilde for nesting structures and two predefined character classes of <space> and <alpha> :

grammar Enhanced-Paragraph {
    token TOP           { <super-word> [ <space> + <super-word> ] + }
    token super-word    { <word> | <enhanced-word> }
    token word          { <alpha> +                }
    token enhanced-word { \* ~ \* <word>           }
}

I was not sure why there were parentheses ( )+ capturing? whitespace so I omitted them. I also adding whitespace around the + quantifier because that reads better for me.

The output then reads :

｢þor is *mighty*｣
 super-word => ｢þor｣
  word => ｢þor｣
   alpha => ｢þ｣
   alpha => ｢o｣
   alpha => ｢r｣
 space => ｢ ｣
 super-word => ｢is｣
  word => ｢is｣
   alpha => ｢i｣
   alpha => ｢s｣
 space => ｢ ｣
 super-word => ｢*mighty*｣
  enhanced-word => ｢*mighty*｣
   word => ｢mighty｣
    alpha => ｢m｣
    alpha => ｢i｣
    alpha => ｢g｣
    alpha => ｢h｣
    alpha => ｢t｣
    alpha => ｢y｣