DEV Community

Juan Julián Merelo Guervós
Juan Julián Merelo Guervós

Posted on • Edited on

Matching with Match.

Scroll in a brick wall

Grammars, and parsing, is about extracting structure from text. But once you've got the structure, there must be a way of actually representing and work with it, so that whatever the machine has grokked about the text can be put to good use.

Many languages do not fuss about this. You want text this or that way, here's your text. For instance, Python

match = re.findall("\*(\w+)\*", sys.argv[1])
Enter fullscreen mode Exit fullscreen mode

This finds all words, or rather alphanumeric combos, that reside inside two asterisks à la Markdown, again. This will do this:

other/match.py "this *is* it" 
['is']
Enter fullscreen mode Exit fullscreen mode

That is, an array with all the words that have matched the regex. Also:

$ other/match.py "this *could* be *it*"
['could', 'it']
Enter fullscreen mode Exit fullscreen mode

Two words match the description, here are your words, sir or madam. However, where do those words start? If you want to somehow mark where your desired structure is found, you will have to go back and use whatever function searches for a string within another.

There must be a better way to do it, right?

Enter the Match object. This object is returned by re.search, as well as re.match, which is like search, only it matches only at the beginning of a string

And with the regex language including just that kind of thing, I really can't fathom why would you need these two methods. Anyway.

match = re.search("\*(\w+)\*", sys.argv[1])
Enter fullscreen mode Exit fullscreen mode

applied to the same thing above, will return <_sre.SRE_Match object; span=(5, 12), match='*could*'> (using pprint), and, in fact, you can also access groups of captured structure; in this case, match.group(1) will return could.

But, what happens with the rest of the matching strings? Well, you could do this:

this_string = sys.argv[1]
while match is not None:
    match = re.search("\*(\w+)\*", this_string )
    pp.pprint( match )
    if match is not None:
        print( match.group(1) )
        this_string = this_string[match.start(1):]
Enter fullscreen mode Exit fullscreen mode

because search only finds the first one, you could go back to findall and then look for the strings...

Or you could try Perl6

my @matches = @*ARGS.map( {$^þ ~~ m:g/\* ~ \* (\w+)/} );

say @matches.map: { $^þ.perl ~ "\n" };
Enter fullscreen mode Exit fullscreen mode

Yes, I know, it's kinda scary, right. But two lines. Even one:

say @*ARGS.map( {$^þ ~~ m:g/\* ~ \* (\w+)/} ).map: { $^þ.perl ~ "\n" };
Enter fullscreen mode Exit fullscreen mode

From the command line:

perl6 -e 'say @*ARGS.map( {$^þ ~~ m:g/\* ~ \* (\w+)/} ).map: { $^þ.perl ~ "\n" };' "It *should* match *all* the *stuff*"
Enter fullscreen mode Exit fullscreen mode

Yep, huge line. But just one.

Anyway, here's a bit of scary-looking Perl6 which is actually not such a big deal once you get the gist of it. First, @*ARGS is an array (thus the @ at the beginning) which contains the arguments given to the script through the command line. This will not only work with the first string, but with actually any single string out there.

Perl6 is also very functional. Where there's a loop, there should be a map that maps every element of an array into something else by applying an operation. Which operation? The one between parentheses:

{$^þ ~~ m:g/\* ~ \* (\w+)/}
Enter fullscreen mode Exit fullscreen mode

The so called twigil $^þ using the Icelandic letter thorn which I love and should be used everywhere is just a placeholder which takes in turn the values of every member of the array; in this case it's taking the place of a single element, which is matched (thus the ~~, the matching operator in Perl6 with the regular expression at the right hand side. You have probably seen the grammar post so this might not scare you as much as it should, but let's deconstruct it:

m:g/\* ~ \* (\w+)/
Enter fullscreen mode Exit fullscreen mode

The m indicates that what comes behind is a regular expression that is going to be matched. There are other ops here, but let's not go there for the time being. :g is an adverb that g*lobalizes search, which means that we will not have to go through (h|l)oops to get all the matches as we did before. This will match *all the words in a sentence, not the first one.

Perls use // as quoting construct for regexes, and it's just as well. They are so clearly distinguised from other strings. And we have arrived to the regex itself, which is so not the same as Perl5 and any other language, for that matter

Actually, all languages copied regexes from Perl. But that's another story.

~ is a matching pair indicator. It will say: "I want my stuff to be surrounded by a matching pair or characters". For instance, ( ~ ) will match parenteses, and in our case \* ~ \* will match a matching pair of asterists (escaped because they actually mean something in regex context). But what's actually the expression that whatever is inside the pair must follow? It's right behind, (\w+) with the parentheses that actually mean we are going to capture just that part of the expression. All in all, we mean "Get me alphanumeric characters which are right in the middle of a pair of asterisks".

Not so difficult once you get the grasp of it, right?

And powerful. ~~ will actually return a Perl6 Match, which is actually what we (not-so-pretty) print using .perl, which returns an expression that can be evaluated to a Perl6 object. Kind of like JSON, but for Perl. PSON, so to say. Which looks like this

($(Match.new(list => (Match.new(list => (), made => Any, pos => 10, hash => Map.new(()), orig => "It *should* match *all* the *stuff*", from => 4),), made => Any, pos => 11, hash => Map.new(()), orig => "It *should* match *all* the *stuff*", from => 3), Match.new(list => (Match.new(list => (), made => Any, pos => 22, hash => Map.new(()), orig => "It *should* match *all* the *stuff*", from => 19),), made => Any, pos => 23, hash => Map.new(()), orig => "It *should* match *all* the *stuff*", from => 18), Match.new(list => (Match.new(list => (), made => Any, pos => 34, hash => Map.new(()), orig => "It *should* match *all* the *stuff*", from => 29),), made => Any, pos => 35, hash => Map.new(()), orig => "It *should* match *all* the *stuff*", from => 28))
)
Enter fullscreen mode Exit fullscreen mode

Once again, scary looking. But we see a couple of things here. It's a nested structure, Matches include matches. We can do pretty complex things here. We also have pos, which are matched positions, orig which is what was being matched, and from. But we also have a powerful data structure from where we can extract, for instance, the matched string and what we have captured:

perl6 -e 'say @*ARGS.map( {$^þ ~~ m:g/\* ~ \* (\w+)/} ).map: { $^þ.list ~ "\n" };' "It *should* match *all* the *stuff*"
(*should* *all* *stuff*
)
Enter fullscreen mode Exit fullscreen mode

using list. If we know there's going to be a single argument, let's simplify it to

perl6 -e 'say (@*ARGS[0] ~~ m:g/\* ~ \* (\w+)/).map( {$^þ.caps} );' "It *should* match *all* the *stuff*"
((0 => 「should」) (0 => 「all」) (0 => 「stuff」))
Enter fullscreen mode Exit fullscreen mode

caps returns what is being matched as a pair index, value. We have 3 Match objects here, and we can use them in many different ways to present or process the information that is going to be extracted.

And extracted it will be

We have done it already in the previous post in this series. Unwittingly, we have been hauling Match objects back and forth, and we actually had a look at one. Grammars, same as regexes, also return Match objects. But we can also use them inside a grammar. Which is what we will do in the near future.

Top comments (0)