Steven Godson

Posted on Oct 10, 2021

Introduction to RegEx

#regex #php #javascript #tutorial

So I thought it about time I did some discovery and learning around the use of RegEx what it can bring to a project.

To ensure that I got a structured introduction I undertook the course called REGULAR EXPRESSIONS FOR BEGINNERS – UNIVERSAL on Udemy by Edwin Diaz, which I thoroughly recommend as Edwin is great at boiling the essence of a topic down into something that is easily understood.

The following are the notes that I took while working through the course along with some worked out examples, some which were derived from the course and some which I’ve implemented in other projects.

Hopefully this will be of some use to you and give you a broad understanding of RegEx.

I recommend that you use an online tool for working through this as it will help bring it to life. Personally I use https://RegEx101.com/ as it will work with a number of languages, has a dictionary of syntax and will actually explain to you what you expression is doing as you write it out.

REGEX OPTIONS

In its most basic form RegEx will match against a specified set of characters within a target string:

Example:

/car/g

will search for every instance of the string “car” within a target string of text.

/car/gi

will do the same as the above but will be case insensitive.

/car/gim

will do the same as above but will search across multiple lines of text.

/car/s

will search within a single line using a full stop/period to identify the end of the line.

/car/imgu

will be case insensitive, search across multiple lines, will search globally and look at Unicode characters.

It is important that you get the setting of these options correct in your expressions to ensure that your matching against exactly what you want.

META CHARACTERS

/c.r/i

adding a full stop/period anywhere in your string is essentially adding a wildcard. This means that, under this example, the expression will match against anything that starts with the letter c and ends with the letter r. It will not care if there is a letter or a symbol in between them, so be careful if you specifically want to search for an actual full stop/period.

ESCAPING

/c\.r/g

so if you want to search specifically for full stop/period, or if your search string includes something like a “/” because you are searching a URL for example, then you will need to escape that particular character. This is done by adding a “**” in front of the character so as you will see an example the expression of only return matches to “c.r**”

some language engines will automatically escape characters so RTFD…

MORE CHARACTERS

escaping can also be used to match against a whole host of different characters or actions within your RegEx expression for example

/C\n\t\R/

will specifically look for C, then a new line, then a tab and then an R.

But be mindful that the characters work in different ways depending on what language engine you are working with e.g. JavaScript or.net or PHP et cetera so refer to the documentation.

RANGES

[car]

treats this as a range of letters to search for and will return every instance of each individual letter.

[a-z]

search for every alphabetical letter and return every match. This is case sensitive so example only search for lowercase letters whereas [A-Z] will only search for uppercase. They can both be combined in the same range. [0-9] will also do the same thing, but for the range of numbers zero and nine. The ranges do not have to start or finish as stated above just as easily be [b-f] for example.

[abdq]werty

search for all of the letters within the range as well as anything that matches “werty”.

the range, start and end point can bhe anything you want it to be as long as it is separated by a [-] within square brackets.

NEGATION

negation is when you tell the expression to exclude something and which is done using the ^ symbol (shift + 6 on Windows keyboard). An example of this would be [^cat] which would tell your expression to ignore any of the letters within the range.

SHORTHAND

so shorthand is, and don’t shout at me for this, a bit like a macro or short name function as you can type in \ followed by a specific character or letter and it will produce the equivalent of typing out a longer expression range. I have added a couple of examples below but for full details refer to the documentation language engine;

\s - looks for any whitespace character.

\S - looks for any non-whitespace character.

\d - looks for any digit.

\D - looks for any non-digit.

\w - looks for any word character.

and the list goes on. What is good about this is that you combine them within ranges and negation to make your code shorter.

But be mindful as some of them produce slightly odd results such as \b which looks at the boundary of what it considers to be a word but will include digits e.g. it would match against all of Lettuce468.

These can also be used to create a pattern against which to match if you need to be very specific, for example;

/[\w\w\w\w\w\w@xenos\-design\.co\.uk]/

would match against my email address, or any email address under the same domain where the word in front of the @ symbol is six characters long.

REPETITION

Quantifiers – these are meta characters when added tell you expression to select varying amounts of the character that proceeds it, for example;

a? - will match zero or one of a, where a represents what you want to match against.

a* - will look for from 0 to more of a. I have also seen as described as the Greedy quantify because it will match as many times as possible.

a+ - will look for one or more of a.

a{x} - will look for the specified number of a where x equals a number.

a{x,} - will look for xc or more of a where x is a number.

a{x,y} - will look for the number of a’s between x and y.

a*? - This will match a from zero or more times, but as few times as possible. This is known as a lazy or reluctant quantifier.

These can be combined to create an expression that will search for a pattern, an example of this would be

/\d {5}-\d {4}/

would match any set of values that look like this 12345-6789 an obvious use case for this could be if you’re searching for telephone numbers in a dataset where there is a specified format. The same could obviously be done for text strings as well or indeed combinations of both.

GROUPING

must be done outside of the character/range set otherwise parentheses will just be escaped, however, characters set/range can be put inside grouping so ([0-9]) will work but [()] will not.

An example of how this would work is save(d)? which makes ‘d’ optional and therefore would match against both ‘save’ and ‘saved’.

ALTERNATION

the use of the pipe symbol ‘|’ (shift + \ on a windows keyboard) effectively works as an OR statement. However some nuances of it are;

- whatever is written on the left takes precedence.

- global needs to be switched on for it to work.

- it can be used as many times as needed.

a more effective way to use this is to include grouping i.e. (Bat|Super)man will return against both ‘Superman’ and ‘Batman’.

Example to work in – (\w+|file\d {3}_export. sql) file201_export.sql remember that the right of the pipe sign takes precedence and will be classed as an eager function as is looking for all words so would return against an underscore but not a hyphen or a full stop.

Alternation can also be used in a nested group i.e. (soup (bowl|spoon)) will return against ‘soup bowl’ and ‘soup spoon’, be mindful of the spacing as this is quite key to a working.

ANCHORS

^s - will look for the first ‘s’ in a string e.g. it has to be at the beginning of the string.

s$ - as above but at the end of the string.

^[a-z] - will look for any texturing that starts with a lowercase letter.

WORD BOUNDARIES

\b - this will match up to the boundary of each word or word character.

\B - this will match twin non-word boundary, which is somewhat confusing, as it will not match on any string of less than three word characters but on a string of more than three characters it will match as follows ‘t*es*t’ as it treats the beginning and end characters as the boundaries themselves rather than the spaces around them as the boundary.

BACK REFERENCES

is a way of referring to the string of text or digits within a grouping.

Typically most engines let you save up to 9 back references, shown as follows day(light) \1 with the ‘\1’ being the reference to the “variable”. This would only match if the texturing it is searching is written as follows daylight light.

Example:

<p id="para">Steven Godson</p>

var para = document.getElementById('para').innerHTML;

var pattern = /(\w+)\s(\w+)/;

var newString = para.replace(pattern, "$2");

console.log(newString);

so you can add this to an HTML file, then run it in a browser and look at the console to see that all that has been console logged is the second part of my name e.g. “Godson” as the JavaScript will get the inner HTML from the paragraph element apply the pattern over it which was essentially searching for a ‘word space word’ pattern and assigning the variables $1 and $2 to the two groupings.

Then the variable newString will be assigned the value of $2 which is then console logged.

NON-CAPTURING GROUP

this example we see yet another way a ‘?’ can be used to do something different within your expression.

/(food) and (?:travel) and \1/

will match against “food and travel and food” because is repeating the first variable, whereas

/(?:food) and (travel) and \2/

will match against “food and travel and travel” because it is repeating the second variable which is travel.

Using the “?:” switches off that variable.

POSITIVE AND NEGATIVE ASSERTIONS

/[A-Za-z]+(?=,)/

this will search for any upper or lowercase letter that is followed by a comma.

/[A-Za-z]+(?!,)/

this literally does the reverse (if != ,) e.g. does not equal comma and so will match every string of upper and lowercase letters that do not end with a comma.

POSITIVE AND NEGATIVE LOOK AHEAD

/[A-Za-z]+(?<=,)/

very similar to the previous section except that by adding the “<” you’re telling the expression to look at what comes before e.g. under this example it would look for every texturing that is preceded by a comma.

/[A-Za-z]+(?<!,)/

again, literally the reverse, where you are looking for everything that is not preceded by comma.

MULTILANGUAGE SYMBOL SUPPORT

RegEx includes support for Unicode, so no matter what language you are searching against you will be able to use Unicode to create a match in your expression.

You can find the full Unicode listings at https://home.unicode.org/.

If it is to be included it in your ReEex expression as follows \u2022 using the “\” to escape the u so that it is turned into the Unicode character.

EXAMPLES

Password Validation – the following is an example of an expression that could be used to validate against contents of the users chosen password to ensure that it matches against the policy in place on our project.

/^(?=.\*[A-Z])(?=.\*[a-z])(?=.\*\d)(?=.\*[!$£#])\S{5,20}$/gm

(?=.*[A-Z]) – this validates against the password having at least one upper case character.

(?=.*[a-z]) – this validates against the password having at least one lowercase character.

(?=.*\d) – this validates against the password having at least one number.

(?=.*[!$£#]) – this validates against password having at least one of the identified symbols within the square brackets.

\S{5,20} – this validates him the password being a minimum of 5 and a maximum of 20 characters.

Pretty URLs – the following example is something that I have deployed myself and is commonly seen on websites to make the URL in the browser more human readable. This example is specific to PHP running on an Apache server.

Step One – ensure your Apache server has the rewrite engine switched on.

Step Two – create a new file called .htaccess in your website’s root directory.

Step Three – opening with your code editor and add the following:

RewriteEngine on

the switches the rewrite engine on*

RewriteRule /^post/(\d+)$ post.php?p_id=$1/ [NC,L]

this tells the server to replace anything with a post.php?p_id=$1, where $1 equals the number identified in the group, and replaces it with post/post number e.g. domainname.com/posts/178, with the NC denoting that it is case insensitive and the L denoting but this is the last rule that should be processed.

SUMMARY

Hopefully this brief introduction to RegEx has sparked your interest to go and explore more for yourself and understand how this very powerful tool could be used in your projects.

I’ve enjoyed learning about something that seemed to be quite daunting before, but now seems quite simple once you understand the syntax.

I have added some references below to language specific documentation and a couple of tools that I found useful during this learning process.

REFERENCES AND RESOURCES

.net - https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference#:~:text=A%20regular%20expression%20is%20a,For%20a%20brief%20introduction%2C%20see%20.

JS - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

PHP - https://www.php.net/manual/en/reference.pcre.pattern.syntax.php

Java - https://www.w3schools.com/java/java_RegEx.asp

Golang - https://golang.org/pkg/RegExp/syntax/

Online Tool - https://RegEx101.com/

DEV Community