DEV Community

Tomasz Wegrzanowski
Tomasz Wegrzanowski

Posted on

100 Languages Speedrun: Episode 47: Raku (Perl 6) Regular Expressions

What gets to be included in languages and what gets pushed into some third party library is a result of history more than reason.

For example pretty much every language comes with every possible trigonometric function included and preloaded, so you can run Math.asinh(69.0) without even requiring anything. What was the last time you wrote a program that needed Math.asinh in your program?

Meanwhile string matching and processing is about the most common thing programs do, but before Perl I don't think any general purpose language included regular expressions. It was something either text processing languages like sed and awk did, or left for third party libraries. Perl fully embraced them, made them far more powerful, and in the post-Perl era including regular expressions is just a normal natural thing languages do.

There's a similar story with package managers. First they didn't exist. Then they existed as poorly integrated third party tools like Ruby's rubygems, and JavaScript's npm. Nowadays, we expect every new language to simply have fully functioning package manager builtin.

Anyway, since days of Perl, Perl Compatible Regular Expressions, or their very close variant, and the default regular expression engine for new languages. Nobody serious considers using any of the pre-Perl regular expression systems, with their limitations and irregularities (but if you want to try some, UNIX grep command is still using stupid 1970s style regexps by default - even a|b is broken!). You can see some comparison chart here, the pre-Perl and post-Perl divide is really apparent, even if they disagree on a few issues.

The only exception seems to be Raku (originally known as Perl 6)(https://taw.hashnode.dev/100-languages-speedrun-episode-26-raku-perl-6), which decided to just design its own regular expression syntax, and it's about this sublanguage this episode is about.

I found a massive bug in Raku Regular Expressions

Before I start, let's just get it out of the way - Raku has a massive design bug in its regular expressions:

#!/usr/bin/env raku

sub is_small_int($n) {
  !!($n ~~ /^ \d ** {1..6} $ /)
}

my @examples = (
  # Correctly True
  '0',
  '0001',
  '12345',
  # Correct False
  '-17',
  '1234567',
  '3.14',
  # Not ASCII digits
  '๓๓๓',
  '௫๓௫๓',
  '១๑໑',
);

for @examples -> $n {
  say "is_small_int($n) = ", is_small_int($n);
}
Enter fullscreen mode Exit fullscreen mode

What it does:

./raku_bug.raku
is_small_int(0) = True
is_small_int(0001) = True
is_small_int(12345) = True
is_small_int(-17) = False
is_small_int(1234567) = False
is_small_int(3.14) = False
is_small_int(๓๓๓) = True
is_small_int(௫๓௫๓) = True
is_small_int(១๑໑) = True
Enter fullscreen mode Exit fullscreen mode

The last three examples are just plain incorrect.

The bug didn't take long to find - documentation for Raku regular expressions literally says this broken way is how \d works in Raku.

So why I'm saying it's a bug? Because regular expressions need to be able to process computer data, and matching a digit (ASCII 0 to 9) is about the second most common thing to do after matching a literal characters. In the entire history of regular expressions, I don't know if there was even one case when someone actually wanted to match Unicode digits, and it's not like their job was hard, Raku has zero problems matching by Unicode properties:

#!/usr/bin/env raku

sub is_unicode_digits($n) {
  !!($n ~~ /^ <:N> ** {1..6} $ /)
}

my @examples = (
  # Correctly True
  '0',
  '0001',
  '12345',
  # Correct False
  '-17',
  '1234567',
  '3.14',
  # Non ASCII digits
  '๓๓๓',
  '௫๓௫๓',
  '១๑໑',
);

for @examples -> $n {
  say "is_unicode_digits($n) = ", is_unicode_digits($n);
}
Enter fullscreen mode Exit fullscreen mode

If I said that wanting to match [0-9] is approximately a MILLION times more common than wanting to match <:N>, I'd be massively understating my case. We might be dividing my zero and getting an infinity here.

And just to show that I'm right, the very same document that defines how \d works, then proceeds to casually assume \d will match ASCII numbers 0 to 9, with examples such as:

/ <[\d] - [13579]> /;
Enter fullscreen mode Exit fullscreen mode
s/ (\d+)\-(\d+)\-(\d+) /today/;
Enter fullscreen mode Exit fullscreen mode
s/ (\d+)\-(\d+)\-(\d+) /$1-$2-$0/;
Enter fullscreen mode Exit fullscreen mode
my regex ipv4-octet { \d ** 1..3 <?{ True }> }
Enter fullscreen mode Exit fullscreen mode
my regex number { \d+ [ \. \d+ ]? }
Enter fullscreen mode Exit fullscreen mode

So without any doubt, Raku \d and \D are completely 100% broken, and hopefully they fix it, as broken \d means basically every regular expression will either be incorrect and potentially introduce security vulnerabilities, or people learn to avoid \d and use the extremely verbose <[0..9]> instead.

This is not a trivial problem. By a quick greps for regexps in a few codebases in a few languages, \d is indeed the most ubiquitous regexp escape code, and it's supposed to mean ASCII digits 0 to 9 every single time.

Raku didn't even come with this bug, another language was doing the same broken thing before. It's still 100% unquestionably broken.

Regular expression basics

Anyway, now that we got it out of the way, let's talk about Raku regular expressions basics.

Traditional regular expressions really overloaded a few special characters and their combinations to mean so many different things, so when a new feature was added, it had to use more and more nasty combination of same few special characters. Raku does a big restart, making some common regular expressions more verbose, but now it has a lot more syntax to work with.

As expected, regular expressions go between slashes //. You can match them with ~~. A few common operations like substitution s/// have extra syntax too.

#!/usr/bin/env raku

my $s = "Hello, World!";
say "We are saying Hello" if $s ~~ /Hello/;

$_ = "Hello, World!";
say "We are saying Hello" if m/Hello/;

# Spaces are ignored by default on the regexp side
# but not on substitution side
$_ = "Hello, World!";
s/ World /Alice/;
say $_;

# :i for case insensitive
$_ = "Hello, World!";
s:i/ world /Alice/;
say $_;

my $n = "Alice";
say "It is Alice" if $n ~~ regex {
  ^       # start of string
  (A | a) # lower or upper case A
  l       # lower case l
  i       # lower case i
  c       # lower case c
  e       # lower case e
  $       # end of string
}
Enter fullscreen mode Exit fullscreen mode

There are a few obvious changes:

  • spaces are ignored by default, so you can make regular expressions a lot more readable, with spacing, comments, and so on
  • switches go on the beginning not the end
  • ^ and $ are start and end of string, with no line stuff, and that's honestly a much more sensible default than complex rules traditional regular expressions had
  • many of the common things like | and () work just the same

Character classes

Raku decided that very common task of a non-grouping match should get [foo] instead of (?:foo). This meant that character classes now needed something more verbose so [0-9] is now <[0..9]>.

#!/usr/bin/env raku

my $number_regexp = rx/
  ^
  '-'?
  <[0..9]>+
  [
    '.'
    <[0..9]>+
  ]?
  $
/;

my @examples = (
  # Numbers
  '0004',
  '-123',
  '1234.5678',
  '-3.14',
  # Not numbers
  '1.2.3',
  '.8',
  '-5.',
  '๓๓๓',
  '௫๓௫๓',
  '១๑໑',
);

for @examples -> $n {
  say $n, ($n ~~ $number_regexp) ?? " is a number" !! " is NOT a number";
}
Enter fullscreen mode Exit fullscreen mode
$ ./classes.raku
0004 is a number
-123 is a number
1234.5678 is a number
-3.14 is a number
1.2.3 is NOT a number
.8 is NOT a number
-5. is NOT a number
๓๓๓ is NOT a number
௫๓௫๓ is NOT a number
១๑໑ is NOT a number
Enter fullscreen mode Exit fullscreen mode

Character class operations

This makes it possible to do some operations on character classes, like + (already possible with traditional regexp with just concatenation) and - (not directly doable).

#!/usr/bin/env raku

# Some letters are too easy to confuse with numbers, filter them out
my $nice_letter_rx = rx/ ^ <[A..Z] + [a..z] - [lIO] > $/;

my @examples = ('a'..'z', 'A'..'Z', '0'..'9').flat;

for @examples -> $c {
  say $c, " is not a nice letter" unless $c ~~ $nice_letter_rx;
}
Enter fullscreen mode Exit fullscreen mode
./classes_math.raku
l is not a nice letter
I is not a nice letter
O is not a nice letter
0 is not a nice letter
1 is not a nice letter
2 is not a nice letter
3 is not a nice letter
4 is not a nice letter
5 is not a nice letter
6 is not a nice letter
7 is not a nice letter
8 is not a nice letter
9 is not a nice letter
Enter fullscreen mode Exit fullscreen mode

Repetition

Traditionally repetition of A to B times used {A,B} syntax. Raku syntax is more verbose but it has more features. Let's start with the basic case. Also notice how special characters generally need to be quoted if you want to use them literally.

#!/usr/bin/env raku

my $rx = rx/
  ^
  <[0..9]> ** {1..3}
  '.'
  <[0..9]> ** {1..3}
  '.'
  <[0..9]> ** {1..3}
  '.'
  <[0..9]> ** {1..3}
  $
/;

my @examples = (
  '127.0.1',
  '8.8.8.8',
  '127.0.0.420',
  '127.0.0.9001',
);

for @examples -> $n {
  say $n, ($n ~~ $rx) ?? " looks like IP address" !! " does NOT look like IP address";
}
Enter fullscreen mode Exit fullscreen mode
$ ./repetition.raku
127.0.1 does NOT look like IP address
8.8.8.8 looks like IP address
127.0.0.420 looks like IP address
127.0.0.9001 does NOT look like IP address
Enter fullscreen mode Exit fullscreen mode

Raku supports "repetition with separator" syntax X ** {2,4} % Y means 2-4 Xs, with Ys in between them:

#!/usr/bin/env raku

my $rx = rx/
  ^
  [ <[0..9]> ** {1..3} ] ** 4 % '.'
  $
/;

my @examples = (
  '127.0.1',
  '8.8.8.8',
  '127.0.0.420',
  '127.0.0.9001',
);

for @examples -> $n {
  say $n, ($n ~~ $rx) ?? " looks like IP address" !! " does NOT look like IP address";
}
Enter fullscreen mode Exit fullscreen mode

This is especially useful if the thing matched is more complex. How many times you wished you were able to do something like this?

#!/usr/bin/env raku

my $rx = rx/
  ^
  [
  | <[0..9]>            # 0-9
  | <[1..9]> <[0..9]>   # 10-99
  | 1 <[0..9]> ** 2     # 100-199
  | 2 <[0..4]> <[0..9]> # 200-249
  | 25 <[0..5]>         # 250-255
  ] ** 4 % '.'
  $
/;

my @examples = (
  '127.0.1',
  '8.8.8.8',
  '127.0.0.420',
  '127.0.0.9001',
);

for @examples -> $n {
  say $n, ($n ~~ $rx) ?? " looks like IP address" !! " does NOT look like IP address";
}
Enter fullscreen mode Exit fullscreen mode

Notice extra validation:

$ ./ipv4.raku
127.0.1 does NOT look like IP address
8.8.8.8 looks like IP address
127.0.0.420 does NOT look like IP address
127.0.0.9001 does NOT look like IP address
Enter fullscreen mode Exit fullscreen mode

There's also %% which allows for an optional trailing delimiter.

In ( a | b | c ) or [ a | b | c ] alternation you can put an extra initial | for formatting and it is ignored (it does not match empty).

Divides by 3

Regular expressions can be recursive with <~~>.

Let's do something that's a lot more difficult with traditional regexps, checking if a number divides by 3:

#!/usr/bin/env raku

my $divides_by_three_rx_part = rx/
  [
  | <[0369]>                              # 0
  | <[147]> <~~>? <[258]>                 # 1+2
  | <[147]> <~~>? <[147]> <~~>? <[147]>   # 1+1+1
  | <[258]> <~~>? <[147]>                 # 2+1
  | <[258]> <~~>? <[258]> <~~>? <[258]>   # 2+2+2
  ]
  <~~>?
/;
my $divides_by_three_rx = /^ $divides_by_three_rx_part $/;

for 1234560..1234579  {
  say $_, ($_ ~~ $divides_by_three_rx) ?? " divides by 3" !! " does NOT divide by 3";
}
Enter fullscreen mode Exit fullscreen mode
$ ./divisible_by_three.raku
1234560 divides by 3
1234561 does NOT divide by 3
1234562 does NOT divide by 3
1234563 divides by 3
1234564 does NOT divide by 3
1234565 does NOT divide by 3
1234566 divides by 3
1234567 does NOT divide by 3
1234568 does NOT divide by 3
1234569 divides by 3
1234570 does NOT divide by 3
1234571 does NOT divide by 3
1234572 divides by 3
1234573 does NOT divide by 3
1234574 does NOT divide by 3
1234575 divides by 3
1234576 does NOT divide by 3
1234577 does NOT divide by 3
1234578 divides by 3
1234579 does NOT divide by 3
Enter fullscreen mode Exit fullscreen mode

We still needed to do that in two parts as anchors are not part of the recurssion. I'm not sure if it's possible to do it with some : modifier, none of them seem to match.

FizzBuzz

This lets us do the holy grail of regular expressions, the FizzBuzz regexp! For comparison, we did it with traditional regexp back in the Sed episode(https://taw.hashnode.dev/100-languages-speedrun-episode-07-sed-and-regular-expression-fizzbuzz), and it was far more complex and completely unreadable. This one makes a lot of sense.

We just need one really useful feature - a regexp that two regexps match. / A && B / matches if both A and B match. In this case we have regexps for divisible by 3 and a very simple one for divisible by 5. Thanks to && it's really easy to get divisibility by 15 from it.

#!/usr/bin/env raku

my $rx3_part = rx/
  [
  | <[0369]>                              # 0
  | <[147]> <~~>? <[258]>                 # 1+2
  | <[147]> <~~>? <[147]> <~~>? <[147]>   # 1+1+1
  | <[258]> <~~>? <[147]>                 # 2+1
  | <[258]> <~~>? <[258]> <~~>? <[258]>   # 2+2+2
  ]
  <~~>?
/;
my $rx3 = /^ $rx3_part $/;
my $rx5 = /^ <[0..9]>* <[05]> $/;
my $rx15 = / $rx3 && $rx5 /;

for 1..100 -> $n {
  # In Raku we need to convert Int to Str, otherwise can't s/// it
  # In Perl it would magically change type for us
  $_ = "$n";
  s/^ $rx15 $/FizzBuzz/;
  s/^ $rx5 $/Buzz/;
  s/^ $rx3 $/Fizz/;
  say $_;
}
Enter fullscreen mode Exit fullscreen mode
$ ./fizzbuzz.raku
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz
...
Fizz
97
98
Fizz
Buzz
Enter fullscreen mode Exit fullscreen mode

Should you use Raku Regular Expressions?

Regular expressions have a lot of features, so I could keep going, but these are likely the features you'd use the most.

I think most of the changes are sensible. Allowing free spacing and comments by default was much needed (most languages have some kind of support for it with //x etc., but because x goes on the end this causes a lot of parser confusion). Changing ^$ to be just the start and the end of the string with no special logic was a great change. && was much needed, ** % and ** %% are very clever shortcuts for something very common, recursion can simplify a lot of regexps, [] for non-matching grouping is quite nice, and so on.

Of course all this needs to be balanced by \d being completely broken, and \d is about the most commonly used regex feature. The good thing is that it can be fixed in 100% compatible way! Just make \d match 0 to 9 and nothing else. Not only it will not break any software, as nobody in history ever relied on this broken \d behavior, but it will likely fix a lot of bug, and likely many security vulnerabilities as well.

It either gets fixed, or you'd need to keep telling people to never ever use \d, and good luck with that.

So if you're designing a new language and its regular expression system, you should definitely consider doing changes similar to what Raku did. But keep \d correct please.

Also, this is likely not going to be the final Raku episode, as Raku Grammars are another sublanguage I want to cover in this series.

Code

All code examples for the series will be in this repository.

Code for the Raku Regular Expressions episode is available here.

Discussion (20)

Collapse
lizmat profile image
Elizabeth Mattijsen

And here I thought that documenting a bug, would turn it into a feature? :-)

But seriously, all of the matching in Raku is based on Unicode properties. So why should \d be any different? And Tamil programmers that are used to using ௦ ௧ ௨ ௩ ௪ ௫ ௬ ௭ ௮ ௯ will be equally surprised to see 0 1 2 3 4 5 6 7 8 9 match \d. Welcome to a world where all is not ASCII!

More generally, when you are working with text, all of us in IT will need to get used to the idea that all is not ASCII. You may argue that using \d is a gigantic WAT? But I'd argue, it should be an eye opener. In that respect, thank you for POINTING THIS OUT in your blog post :-).

I just hope that people will continue to read after "I found a massive bug in Raku Regular Expressions". :-(

Collapse
taw profile image
Tomasz Wegrzanowski Author

It doesn't even handle two most popular number systems (二十 or MMXXII). Meanwhile even actual Tamils use regular ASCII numbers as you can see.

The Raku \d is simply a bug, and it will only cause problems. It's even worse than 0 prefix turning on octal quite a few languages do.

Collapse
lizmat profile image
Elizabeth Mattijsen

Actually, Raku is flexible enough to allow for Roman Numerals in a module: Slang::Roman. And who knows, it might actually make it into the language at some point.

Showing a Tamil language page that does not use Tamil numerals is only proof of the fact that at least some Tamil pages do not use Tamil numerals. It does not proof there aren't any other pages that do use Tamil numerals. And there are other uses of text beside the Web :-)

Also, Tamil was just an example. There are about 50 languages in the Unicode standard with their own numeric representation.

Re "二十", yes, perhaps we should make a slang for that as well. Anyone up for Slang::CJK?

Comparing a well thought out behaviour of a feature in Raku with a mistake made in the past, feels like a disservice. You can disagree with the decision of this behaviour, but considering it a bug is wrong:

From Wikipedia: "A software bug is an error, flaw or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways'. The result is not incorrect, nor unexpected, nor unintended.

Pardon the analogy, you're like a driver used to drive on the right side of the road, suddenly needing to drive a car with the steering wheel on the right-hand side of the car. And then wondering why the window-wipers switch on in the very first turn that you need to make.

Thread Thread
taw profile image
Tomasz Wegrzanowski Author

docs.raku.org/language/regexes thinks "௩.௩.௩.௩" is a valid IP address, and good luck pinging that.

Even people who wrote the page explaining the Raku \d can't actually follow how the broken \d works and assume it works on ASCII only in the rest of the document.

Thread Thread
lizmat profile image
Elizabeth Mattijsen

Then the documentation is where the error is.

Thread Thread
taw profile image
Tomasz Wegrzanowski Author

Can you find even a single actual Raku program out there, where \d is used, and it intentionally means <:Nd> and it would break the program if \d was changed to match <[0..9]>?

Unfortunately Github code search can't handle special characters like backslash so it can't search for \d directly, and it confuses Raku with Perl 5 when filtering, but here's a start: https://github.com/search?q=filename%3A%22*.raku%22+language%3ARaku&type=Code

Just clicking randomly I see a lot of \d, and ALL of them assume that \d will be ASCII digits is everywhere. Explicit <[0..9]> are very rare. Anyone wanting <:Nd>? I haven't found a single case yet.

Thread Thread
vrurg profile image
Vadim Belman

All I could say about pinging the IP is that your parser is just not able to convert the representation into an unsigned 32 bit integer. But it doesn't mean that no other parser is capable of this. Enough to say that dotted notation is a convention. Network addresses are just numbers in their nature.

Collapse
jj profile image
Juan Julián Merelo Guervós

But is there any basis to calling it a bug other than classically \d has matched only latin decimal digits? (if only because there were no others). At the end of the day, there does not seem to be a standard (other than maybe PCRE, which is a de facto standard?) so making \d == [0..9] or \d == <:Nd> is simply a judgment call. As long as it's properly documented, we're good with that, I guess.

Thread Thread
taw profile image
Tomasz Wegrzanowski Author

It is a bug, because \d is extremely well established to match [0-9] and this is about the most common regexp escape code, programmers will rely on this, and this "almost" works.

I think the fact that Raku documentation has this issue, on the same page even, pretty much proves it. According to that documentation ௩.௩.௩.௩ is a valid IP address.

<:Nd> is such a rare thing you'd have a lot of trouble coming with a single use case for it. If you think it can find numbers in text you don't know language of (and how often is that a thing?), it won't even do that (Chinese and Roman numbers being most obvious). And if you somehow come up with a super rare use case for <:Nd>, you can use <:Nd> - or more likely some much more specific character class like /<:Nd>&<:Tamil>/.

Thread Thread
lizmat profile image
Elizabeth Mattijsen

Well, I guess there is one place where \d is intended to match all numerics. And that's the grammar that Raku uses to parse Raku source code. Which allows the example Jonathan Stowe gave to work.

I see that we will not agree on whether the current behaviour is correct or not (even though apparently Raku is not the only one).

I'm looking forward to you covering of Raku grammars. :-)

Collapse
jonathanstowe profile image
Jonathan Stowe

I think you are wildly overstating the \d thing. In Raku a character with the numeric unicode property is a digit:

raku -e 'say ௫๓௫๓'
5353
raku -e 'say ௫๓௫๓ + 1'
5354
Enter fullscreen mode Exit fullscreen mode

Given that, it would be perverse not to match those with \d.

Collapse
sjn profile image
Salve J. Nilsen • Edited on

How is \d accepting non-arabic numerals a bug?

Maybe you're used to \d meaning <[ 0 .. 9 ]> cause this is what you've always been exposed to, but why should this be the only case allowed? Why should a general-purpose programming language enforce a limitation like that, when it doesn't have to?

The world's a big place with lots of languages, and Raku has been designed to also make it easier to handle issues around internationalization and localization without jumping through crazy hoops... This is a good thing!

So if you due to some cultural (or other) limitation fail to imagine more than a single type of numeric inputs, then maybe you'd want to look for that "bug" somewhere closer to home? Just askin'...

Collapse
raiph profile image
raiph

The following is true for PCRE (and hence PHP because it uses PCRE), and the default regex engines for Python and Java:

  • If input is ASCII, \d only matches 0 thru 9.

  • If input is Unicode and Unicode matching is enabled, \d matches .

You can verify this at regex101.com. Just select a regex flavour, enter \d as the regex, click the flags at the end of the regex to enable the selected regex flavour's Unicode matching, enter as the input string, and note that it matches.


The behaviour described above applies to most regex engines, and Raku too.

Because ASCII is a subset of Unicode, \d will still match 0 thru 9, and only 0 thru 9, if the input is ASCII. This is just as true for Raku as it is older regex engines.

And, just like PCRE/PHP/Python etc, Raku will also match foreign language decimal digits if the input is in a foreign language.

The sole difference is that, with Raku, one doesn't have to switch on Unicode processing, it's on by default.

(Of course, this means that if someone wishes to enforce that input is ASCII, they have to specify that. But that's very easy to do.)


To quote from your article:

Nobody serious considers using any of the pre-Perl regular expression systems ... the pre-Perl and post-Perl divide is really apparent

Indeed. Larry Wall, the lead designer of both Perl and Raku, understood what folk needed.

The only exception seems to be Raku ... which decided to just design its own regular expression syntax

As Larry put it in 2002:

In fact, regular expression culture is a mess, and I share some of the blame for making it that way. Since my mother always told me to clean up my own messes, I suppose I'll have to do just that.


I don't know if there was even one case when someone actually wanted to match Unicode digits

I don't know if there will be literally trillions of cases, but I'd say my guesstimate is as reasonable as yours. But let's put aside guesstimating such a thing for a moment, and focus on verifiable estimates.

Data shows that the western world's share of Internet content by volume is rapidly shrinking. Indians, Chinese, Arabs, and other non-Western world folk are pouring onto the net and writing things online in their mother tongues in already vast and yet also rapidly increasing quantities. And what they write includes digits, written in their native scripts, in what is already truly vast, er, numbers. This can be measured.

At the same time, the western world's dominance of Internet software and developers will also soon be history. Credible estimates suggest the country with the largest population of developers in the world at the moment is the US. But those same estimates suggest the country with the largest population of devs in the world before the middle of this decade arrives will be India, and that by 2030, India and China will be duking it out for top dog, with the US and Europe far behind.

So, while I'm not too surprised you think no one will want to match those trillions of digits, because many western devs think that way, I know that credible estimates suggest Larry has correctly nailed this Raku design aspect, just as with the rest of Raku's regex/grammar engine.


Fwiw, here's my hot take.

The main weakness of the engine is very poor performance. Once that's sorted, which I anticipate later this decade (the reason it's slow is understood and fixable), and NQP is repackaged as a retelling of PCRE, but where the engine is now not just a regex engine but a language platform that's easier to get into than Graal/Truffle/JVM, and without the commercial costs and proprietary control exerted by Oracle, Raku will make western folk suddenly sit up as they realize there's more to its rampant adoption by Indians et al in the middle to latter half of this decade, and the sudden explosion of interoperating new PLs and DSLs, than meets the eye.

Remember, you heard it first on your blog. And why? Because characterizing Unicode era \d behaviour as a "massive bug" stung me in to action to try set the world a little straighter. Do you see I might have a point?

Collapse
codesections profile image
Daniel Sockwell

One (hopefully helpful) tip and one comment:

First the tip: in !!($n ~~ /^ <:N> ** {1..6} $ /), you can replace the "not not" (!!) double-negative with ?, the boolean context operator.

Second, the comment: I don't believe that I agree with your claim that \d would be better off matching only ASCII digits. You gave the example of IP addresses, so lets start there – it may be context dependent, but I'd argue that https://①.①.①.① is a valid IP address. At the very least, it's one that I can navigate to in my browser (firefox).

More broadly, it seems that I'd often want \d to match any digit. For example, when applications require that user passwords contain a digit, they're typically doing so to increase the password's security. But "password๓" is much less likely to be in an attacker's dictionary than "password3" is; rejecting the former but accepting the latter strikes me as perverse at best. (Of course, neither password is decent).

In fact, I'd go further than that: I'd claim that a \d that matches only 0..9 is more likely to cover up bugs than to prevent them. The only time that \d ought to match 0..9 but ought not match other numbers is if the programmer is expecting to get ASCII input but is actually getting utf8 input. But the solution there is to reject non-ASCII input (e.g., test that it matches /^<:ascii>+$/ in Raku) – not just fail to match on non-ASCII numbers). IMO, a more limited definition of \d just hides the problem of not realizing that you're dealing with non-ASCII text (or, put differently, the problem of not correctly handling non-ASCII text).

In any event, I enjoyed the post and am looking forward to the one on grammars :)

Collapse
bbkr profile image
Pawel Pabian

You got a point with digits matching, check for example gitlab.com/pheix/net-ethereum-perl.... However I would not call it a bug. Because following this logic you may say that common [abc] is a bug because it does something different than in PCRE. I personally got so used to Raku UTF-ness that my mindset has changed and I always write Unicode aware regexps.

Collapse
jj profile image
Juan Julián Merelo Guervós

If I got this correctly, you're implying that \d should only match ASCII digits, right? We should use Nd to match any unicode digit, and not \d. The massive bug is to make \d == `<:Nd>

Collapse
taw profile image
Tomasz Wegrzanowski Author

Yes, it is a massive bug. It causes a lot of programs to match a lot more than they expect, including very likely a lot of security validations. Everyone including people who wrote those docs assumes \d matches ASCII digits only, and this is needed for basically any parsing of either machine format or human text.

It is exceedingly rare to want to match <:Nd> (I double anyone ever actually used that), and if you absolutely need to, well, you can say <:Nd>, or more likely some more specific range.

It won't even do for extracting numbers from natural language text, as most common numerical systems (Roman and Chinese numerals) don't match <:Nd> as they reuse letters.

Collapse
jj profile image
Juan Julián Merelo Guervós

They don't really reuse letter codepoints; they use a different codepoint in Unicode. They match <:N> alright, and also <:Nl>:

raku -e 'say "Ⅻ " ~~ /<:Nl>/'
「Ⅻ」
Enter fullscreen mode Exit fullscreen mode
Thread Thread
taw profile image
Tomasz Wegrzanowski Author

Nice one, I didn't know they had separate characters for Roman numerals in Unicode. I don't think it's actually used in the wild much, still, nice.

Collapse
epsi profile image
E.R. Nurwijayadi

Cool.