The black art of Regular Expressions

#programming #regex #dry #javascript

The Portable Operating System Interface (POSIX) is a family of standards specified by the IEEE Computer Society. One of the POSIX standards has been adopted (with some minor variations) by many programming languages including C, Java, Python and even JavaScript. Yet many developers are wary of using this powerful tool, despite the fact not using it runs contrary to one of Software Engineering's guiding principles - Do not Repeat Yourself (DRY)

RegExp v DRY

As incomprehensible Regular Expressions (RegEx) are to many developers they provide an effective method of matching and tokenizing text. Yet by not using RegEx developers have to reinvent the mechanism in some other non-standard way. Not exactly the most effective use of developers time and effort.

Short introduction to RegEx

A regular expression is a string of text, sometimes with related flags, used to define a pattern of text you want to find.

For example, if we take the text "The quick brown fox jumps over the lazy dog" we can use a RegEx pattern, such as /\s/ (using the JavaScript syntax) to split the string into individual words.

const text = 'The quick brown fox jumps over the lazy dog';
const regExpPattern = /\s/;
const words = text.split(regExpPattern);
console.log(words.length); // 9

The RegEx pattern /s matches a single whitespace character. There are of course several ways to achieve the same split operation. You don't even need RegEx to split on a matching string but this is a simple 'introductory' example and RegEx is capable of much more.

If we analyse the 'text' using the following JavaScript we get an array-like object in return.

const matches = text.match(/the/);
console.log(matches);

/*
[
  'the',
  index: 31,
  input: 'The quick brown fox jumps over the lazy dog',
  groups: undefined
]
*/

The first property is the exact text that was matched, the index property indicates where in the source text the match was found. The groups property is not used in this example and outside the scope of this post but the input property is the text on which the match was performed.

Notice how it was 'the' and not 'The' that was matched. RegEx patterns are by default case sensitive. In order to match 'The' the pattern could be changed to /The/. Alternatively the pattern could be changed to /[Tt]he to broaden our options. Another option is to use an 'i' flag (/the/i) in order to make the match case insensitive. However, the last two approaches will match both instances of 'the' in the subject text. Prefixing the pattern with '^' (/^the/i) will mean only the first instance of 'the' at the start of the subject text would be matched.

Of course regular expressions can get far more complicated than the examples above. Crafting (or should I say conjuring) them is considered by many something of a black-art, akin to sorcery or alchemy, full of hazards and pitfalls.

Guidance I have found helpful

Test, test and test some more

It is vital to exercise the RegEx patterns not only to ensure they detect what you intended but also negative cases to ensure they don't pick-up matches they should not. You can't exercise every permutation. A good guide to what tests to include it can be useful to understand the routes through the pattern, as illustrated in the banner at the top of this post and can be generated at Debuggex[1].

In the illustration the pattern /^Reg(ular )?Exp(ression)?$/ will match both 'RegExp' and 'Regular Expression, which might have been the intention. But it probably was not the intention to also match 'RegExpression' or 'Regular Exp'. It is all too easy to make such a mistake so care has to be taken.

Focus the pattern by topping and tailing

When a pattern is to be applied to the beginning, end or the entire source text it is a good idea to use the start-of-line ^ and/or end-of-line $ characters in the pattern.

Limit repetition whenever possible

There are a couple of wildcard characters (+ and *) to deal with multiple occurrences within a pattern but these should be used with caution as they are too open-ended and potential vectors for abuse. If an upper limit can be assumed, instead of using + for 1 or more, or * for any number of occurrences the range syntax {min, max} is preferable.

E.g.
Instead of /A+/ to match A, AA, or an unlimited number of As, which is unlikely to be the requirement, it might be better to assume an upper limit such as 20 and use /A{1,20}/.
Likewise, in place of using /AB*C/, if we can assume there will be any number between 0 and 6 Bs in between A and C, a better pattern might be /AB{,6}/.