DEV Community

Lucas M.
Lucas M.

Posted on

The importance of the environment in Regex pattern matching

Here’s a small discovery I made regarding Ruby regex rules and whitespace characters, that made me scratch my head for a moment:

Let’s have a look the following string, extracted from an email body:

From :     John DOE <test@test.com>
Enter fullscreen mode Exit fullscreen mode

Please note that the space character between 'From' and the column (‘:’) is a Non-Breaking Space Character (U+00A0 in Unicode), while the other spaces in this string are regular whitespaces (U+0020).

Let’s now consider the following regex rule, defined in a Ruby constant:

REGEX = /(?:From)\s*:\s*(?:.*?<)?([^<>\s]+@[^>\s]+)(?:>)?/i
Enter fullscreen mode Exit fullscreen mode

When testing the mentioned string against this regex in a Ruby console, we don’t get any match:

REGEX.match(‘From :  John DOE <test@test.com>’)
=> nil
Enter fullscreen mode Exit fullscreen mode

Why, you may wonder?
The reason for this is that the \s matcher does not look for Non-Breaking Space Characters. In order to make it work, we need to update the regex to explicitly expect NBSC characters, as follows:

REGEX = /(?:From)[\s\u00A0]*:\s*(?:.*?<)?([^<>\s]+@[^>\s]+)(?:>)?/i
Enter fullscreen mode Exit fullscreen mode

Everything looks fine up to that point.

However - here it becomes weird - when testing the original regex rule (the first one, without the \u00A0 part) on the same string in an interactive visualiser (https://regexr.com/ for instance), there is a match:

Screenshot of the test made on Regexr

My understanding of the situation is that the interactive Regex visualiser actually converts the NBSC to regular whitespace when copy-pasting the string into its text input, simply because the browser interprets it as a regular whitespace in its HTML rendering.

This little experiment highlights the importance of testing regex patterns in the exact environment where they will be used. While online tools can be helpful for quick tests, they don't always accurately represent how the regex will behave in your production environment.

PS: It is worth mentioning that the string under scrutiny was copy-pasted from the original email at every stage of this experiment, meaning that the string itself wasn’t transformed by the copy-pasting operation.

Top comments (3)

Collapse
 
adrien_cohen_c8d85aea4ceb profile image
Adrien Cohen

Nice catch! I may have encountered a bug like this this week. I will make sure to identify the underlying unicode character

Collapse
 
olivier_f_c234a340d6f706f profile image
Olivier F

Interesting discovery indeed! I can imagine it was a pain to debug...
Did you try to contact the authors of the online interpreter to see if they're aware of this bug on their side?

Collapse
 
lcsm0n profile image
Lucas M.

My feeling is that it's not a pattern-matching issue, but more of an implicit conversion of characters in the tool's text input (NBSCs and tabulation characters are displayed the same way in the tool, which makes me think the former is somehow converted into the latter). The interesting fact is that you can't be 100% sure what you are comparing against your regex because of implicit transformations as this one.
I'll drop an issue on their repo to try and dig a bit deeper!