Tomasz Wegrzanowski

Posted on Sep 14, 2022

Open Source Adventures: Episode 81: Exploring Raku Regular Expression API

#raku #perl #regex

In previous three episodes I explored regular expression APIs of Ruby, Crystal, and Python, so let's finish this by doing the same exercise in Raku.

The problem is the same - there's multiple date formats, and we want to extract information from whichever one matches.

I'm doing it with just 3 regular expressions, but in real world there could be hundreds. Doing it naively with a list of regular expressions would require massive code duplication, and a lot of calls to regular expression engine, which is generally dramatically slower than just matching a|b|c|... once.

The Problem

#!/usr/bin/env raku

use JSON::Fast;

for qw[2015-05-25 2016/06/26 27/07/2017] {
  say to-json(parse_date($_), :!pretty)
}

And expected output is:

[2015,5,25]
[2016,6,26]
[2017,7,27]

Solution 1

sub parse_date($s) {
  if $s ~~ /(\d\d\d\d)\-(\d\d)\-(\d\d)/ {
    [+$0, +$1, +$2]
  } elsif $s ~~ /(\d\d\d\d)\/(\d\d)\/(\d\d)/ {
    [+$0, +$1, +$2]
  } elsif $s ~~ /(\d\d)\/(\d\d)\/(\d\d\d\d)/ {
    [+$2, +$1, +$0]
  }
}

The first way is to just list every possible regular expression sequentially.

In case you're not familiar with Raku, here's a few minor things to notice:

match groups are numbered from $0 not from $1
these variables aren't String type like in other languages, they're Match
we can't to-json a Match object without some kind of conversion, either to string or ot number, so returning [$0, $1, $2] would just crash
+ converts to a number
match operator is ~~, not =~
\d is Unicode digit, not just 0 to 9

There's also a bigger thing to consider - in other languages, unknown punctuation like / or - can be used in regular expression literally. In retrospect this was a mistake, as it prevents adding new regular expression syntax without breaking backwards compatibility. Raku forces escaping every such punctuation, even currently unused, so at some point in the future it can make - or / have some meaning.

Anyway, just like solution 1 in other languages, this suffers from both problems - code duplication, and poor performance due to sequential match.

Solution 2

sub parse_date($_) {
  if /(\d\d\d\d)\-(\d\d)\-(\d\d)/ {
    [+$0, +$1, +$2]
  } elsif /(\d\d\d\d)\/(\d\d)\/(\d\d)/ {
    [+$0, +$1, +$2]
  } elsif /(\d\d)\/(\d\d)\/(\d\d\d\d)/ {
    [+$2, +$1, +$0]
  }
}

Just like in Perl, we don't need to use ~~, if we use regular expression in boolean context it will automatically match $_.

Solution 3

sub parse_date($_) {
  if /(\d\d\d\d)\-(\d\d)\-(\d\d)/ or /(\d\d\d\d)\/(\d\d)\/(\d\d)/ {
    [+$0, +$1, +$2]
  } elsif /(\d\d)\/(\d\d)\/(\d\d\d\d)/ {
    [+$2, +$1, +$0]
  }
}

We can reduce code duplication if groups are in the same order.

Solution 4

sub parse_date($_) {
  if /(\d\d\d\d)\-(\d\d)\-(\d\d) | (\d\d\d\d)\/(\d\d)\/(\d\d)/ {
    [+$0, +$1, +$2]
  } elsif /(\d\d)\/(\d\d)\/(\d\d\d\d)/ {
    [+$2, +$1, +$0]
  }
}

What's going on here? In Raku $0, $1, $2 don't mean the Nth match group in the expression, it means Nth match group that actually matched (at least for | alternatives, the full story is more complicated)!

This is great as we can put as many expressions as we want without any nonsense like +($0 or $3 or $6).

On the other hand, this means we have no way to do alternative if they're in a different order.

Solution 5

sub parse_date($_) {
  if /(\d\d\d\d)\-(\d\d)\-(\d\d) |
      (\d\d\d\d)\/(\d\d)\/(\d\d) |
      (\d\d)\/(\d\d)\/(\d\d\d\d)/ {
    [+$0, +$1, +$2]
  }
}

This doesn't actually work. The block can't tell which branch matched, so it doesn't know if the groups are in YMD or DMY order.

Solution 6

sub parse_date($_) {
  if /
    $<y>=(\d\d\d\d) \- $<m>=(\d\d) \- $<d>=(\d\d) |
    $<y>=(\d\d\d\d) \/ $<m>=(\d\d) \/ $<d>=(\d\d) |
    $<d>=(\d\d) \/ $<m>=(\d\d) \/ $<y>=(\d\d\d\d)
    / {
    [+$<y>, +$<m>, +$<d>]
  }
}

We can solve this with named captures, and it works great.

Story so far

The syntax and API are both very different from traditional regular expressions, but in the end we got everything we needed.

All the code is on GitHub.

Coming next

I was planning to write another post on how we could improve regular expression APIs, but it turns out the APIs we explored (except Python's disappointingly limited one) actually have features that cover most of what I wanted to say, even if they're rarely used in the real world yet.

So the next episode will be about something completely different.

Top comments (2)

Paweł bbkr Pabian • Sep 26 '22

I have a feeling that you are applying Perl approach to Raku (underscore naming, not being explicit about unicode awareness, escaping instead of quoting constants in regexps).

There is nothing wrong with that - TIMTOWTDI. However grammars are first class citizens in Raku and that allows to write it in normalized manner:

sub parse-date($_) {

    my token year { <[0..9]> ** 4 }
    my token month { <[0..9]> ** 2 }
    my token day { <[0..9]> ** 2 }

    if /
      <year> '-' <month> '-' <day> |
      <year> '/' <month> '/' <day> |
      <day> '/' <month> '/' <year>
    / {
        [+$<year>, +$<month>, +$<day>]
    }
}

It took me a while to unlearn creating bulky Perl regexps. Old habits die hard. Regexp interpolation in Perl was tricky and normalization was often avoided. But it is really worth it as it produces much cleaner and less error prone code in Raku.

BTW: token do not backtrack as opposed to regex.

librasteve • Sep 19 '22

https://www.reddit.com/r/rakulang/comments/xilsms/comment/ip45t8p/?utm_source=share&utm_medium=web2x&context=3