DEV Community

Juan Julián Merelo Guervós
Juan Julián Merelo Guervós

Posted on

Role-ing on the grammars

A blackboard does chalk and graffiti
We already know that Perl6 does roles. But this series is about grammars, so sooner or later we had to find them in an article, right?

Using roles in grammars

We already know that grammars are actually classes, a particular kind of classes that returns Matches when parsing a text. But the marked texts we are dealing with are actually combinations of many different elements. Paragraphs are made of (maybe enhanced) words for instance. We do not need to create them hierarchically: we can mix and match the word role in different markdown parsers, from the simplest to the most complicated. And we can create a parser for a semi-decent markdown-like little language like this, starting with the word.

All scripts for this series of articles is in GitHub

role like-a-word {
    regex like-a-word { «\H+» }
}
Enter fullscreen mode Exit fullscreen mode

role declares, you guessed it, a role. But instead of populating it with methods, we use grammar stuff like regex. Grammar roles are just roles.
And regexen are just regular expressions, in the same way we saw them in the Match article. They match things. But tokens do that too, that is what they have done so far. But they do not backtrack. Once they have started to match things and find something that does not correspond to the rule, they fail and don't go back and say, wait, maybe it matches this little other thing. In a word, they behave just like regular regular expressions

I always wanted to say that.

In this case we use them because we do not know where this rule will end up. Allowing it to use backtrack will prove useful later on. And the regex itself might seem weird, with the «» and all. In Perl6, they are just word boundaries. This regex will match anything that is not horizontal whitespace up to a word boundary; this excludes vertical whitespace because it will effectively bound a word. It is a pretty general way of describing words.

We want some other structure to do words. Like this:

role span does like-a-word {
    regex span { <like-a-word>(\s+ <like-a-word>)* } 
}
Enter fullscreen mode Exit fullscreen mode

Declaring this, which is also a role, does like-a-word allows it to use the declared regex with the same name inside it. A span is just a group of things that look like a word. But we can build on that:

role pair-quoted does span {
    proto regex quoted {*}
    regex quoted:sym<em> { '*' ~ '*' <span> }
    regex quoted:sym<alsoem> { '~' ~ '~' <span> }
    regex quoted:sym<code> { '`' ~ '`' <span> }
    regex quoted:sym<strong> { '**' ~ '**' <span> }
    regex quoted:sym<strike> { '~~' ~ '~~' <span> }
}
Enter fullscreen mode Exit fullscreen mode

We want to surround these spans with quote-like things that express emphasis or other kind of things. We use proto which makes all functions use the same signature but work with different code, depending on what they have to deal with. Syntax again might get in the way, but we'll get to that later on. Suffice it to say that we are declaring here different kind of spans.
Theoretically, we could already use this to match things; however, since they do not declare TOP, they have to be used in conjunction with a real grammar. Just like this one:

grammar better-paragraph does pair-quoted {
    token TOP { <chunk>[ (\s+) <chunk>]* }
    regex chunk {  <quoted> | <span> }
}
Enter fullscreen mode Exit fullscreen mode

This grammar only needs to do the most complicated of the roles we have declared, the one which includes all of them. It includes either a quoted (taken from the pair-quoted role) or a span (taken from the span role). By using roles we have simplified the construction of this grammar, and created something that can be easily understood for someone reading it. A better-paragraph is a sequence of chunks, which can be either quoted spans, or simple spans.

Let's put it to use.

Let's do the parsing:

my $simple-thing = better-paragraph.parse("Simple **thing**");
$simple-thing<chunk>.map: { .put };
Enter fullscreen mode Exit fullscreen mode

First line does parsing as usual. And we know this returns a Match object. This object can be used like a hash, which has as keys the tokens that can be parsed from the top. This is what we use in the next line: $simple-thing<chunk>.map: { .put }; has to be read from left to right. $simple-thing<chunk> is a list of the different chunks that have been extracted from the simple text. We will map them to a function, in this case simply put that prints them; that is, .put actually does (implicit loop variable).put; we could use our beloved thorn to write it this way:

$simple-thing<chunk>.map: { $^þ.put };
Enter fullscreen mode Exit fullscreen mode

which would do exactly the same, that is printing:

Simple
**thing**
Enter fullscreen mode Exit fullscreen mode

We might want to do actually get rid of the markers, and just make some note that there was something marking that span. We can do it so:

$simple-thing<chunk>.map: { so $^þ<quoted> ??
                say "["~$^þ<quoted><span> ~ "]"!!
                $^þ.put};
Enter fullscreen mode Exit fullscreen mode

Is it a quoted thing? so $^þ<quoted> ?? so turns into a boolean whatever is to its right. If it exists, it would be true. And then the next would kick in:

say "[ "~$^þ<quoted><span> ~ "]"
Enter fullscreen mode Exit fullscreen mode

Instead of printing directly the <quoted> part, we'll dive more deeply into the Match object and go to the next level, where there should be a . We'll get almost the same as above:

Simple
[thing]
Enter fullscreen mode Exit fullscreen mode

But this is kind of disappointing, right? Go to all that trouble to not be able to use the actual quotes.

Working with unnamed captures

Anything we put inside parentheses in a rule, token or regex will be captured. Let's slightly change the pair-quoted role this way:

    regex quoted:sym<em> { ('*') ~ '*' <span> }
Enter fullscreen mode Exit fullscreen mode

(and do the same to the rest). We'll have two captures in the Match object; the first will contain the quoting operator used and the second will be the same as before. We can change also the printing map:

$simple-thing<chunk>.map: { so $^þ<quoted> ??
                say $^þ<quoted>[0] ~ " → " ~ $^þ<quoted><span> !!
                $^þ.put};
Enter fullscreen mode Exit fullscreen mode

Now $^þ<quoted>[0] contains the captured operator, and the rest is like before. This would print:

Simple
** → thing
Enter fullscreen mode Exit fullscreen mode

Nifty, right? We can put that to good use in our eventual markdown grammar. But this will have to wait until the next installment.

Latest comments (0)