loading...
Cover image for How YOU can learn enough RegEx in JavaScript to be dangerous

How YOU can learn enough RegEx in JavaScript to be dangerous

softchris profile image Chris Noring Updated on ・11 min read

Follow me on Twitter, happy to take your suggestions on topics or improvements /Chris

I'm writing this to my future self. In fact, a lot of my articles are to my future self that forgot everything about how to do something. RegEx, Regular Expressions is a really powerful tool in our toolbox. Sadly we are referring to it as black magic, the devil and other charming things. It doesn't have to be like that. RegEx is different from normal programming for sure but it is also something that's really really powerful. Let's learn how it works and how to actually use it and apply it to everyday problems that you recognize.

TLDR; Is this long? Yes but it does go through the major constructs in RegEx. Also, I have some nice recipes at the end on how to do things like RegEx for email, passwords, date format conversions and how to process URLs. If you have never worked with RegEx before or you struggle to see past all that weird magic - this is for you. Happy reading :)

References

There are some great resources out there for RegEx that I consult regularly. Take the time to read them. Sometimes they explain how RegEx is processed and can explain why the magic happens:

How to practice

  • Node.js REPL, If you have Node.js installed I recommend just typing node in the terminal. This will start the REPL, it's a great way to test patterns
  • JavaScript REPL, this is a VS Code extension that evaluates what you type. You will get instant feedback on results
  • Browser, pulling up Dev Tools in your browser and using the Console will work fine as well

  • RegEx 101
    Great sandbox environment. Thanks for tip Lukasz :)

Regular Expressions

Regular Expressions or RegEx is about pattern matching. A lot of what we do is really about pattern matching if we think about it. RegEx is really good at matching patterns and extracting values from found patterns. So what kind of problems can we solve?

  • URL, a URL contains a lot of interesting information like hostname, route, port, route parameters and query parameters. We want to be able to extract this information but also validate the correctness.
  • Password, the longer the password the better, is usually what we want. There are other dimensions as well like complexity. With complexity, we mean our password should contain for example numbers, special characters and a lot more.
  • Find and extract data, having the ability to find data on a web page, for example, can be made really easy using a couple of well written Regular Expressions. There is actually a whole category of computer programs dedicated to this called screen scrapers.

A regular expression is created either like this:

/pattern/

It starts and ends with /.

Or like this where we create an object from the RegEx class:

new RegEx(/pattern/)

Methods

There are a few different methods meant for different types of usage. Learning to use the correct method is important.

  • exec(), Executes a search for a match in a string. It returns an array of information or null on a mismatch.
  • test(), tests for a match in string, answers with true or false
  • match(), Returns an array containing all of the matches, including capturing groups, or null if no match is found.
  • matchAll(), Returns an iterator containing all of the matches, including capturing groups.
  • search(), Tests for a match in a string. It returns the index of the match, or -1 if the search fails.
  • replace(), Executes a search for a match in a string, and replaces the matched substring with a replacement substring.
  • split(), Uses a regular expression or a fixed string to break a string into an array of substrings.

Let's show some examples given the above methods.

test(), test string for true/false

Let's look at an example using test():

/\w+/.test('abc123') // true

Above we are testing the string abc123 for all alphabetic characters \w+ and we are answering the question, do you contain alphabetic characters.

match(), find matches

Let's look at an example:

'orders/items'.match(/\w+/) // [ 'orders', groups: undefined, index: 0, input ] 

The above array response tells us we are able to match orders with our pattern \w+. We didn't capture any groups as indicated by groups:undefined and our match was found at index:0. If we wanted to match all the alphabetic characters in the string we would have needed to use a flag g. g indicates a global match, like so:

'orders/items'.match(/\w+/g) // ['orders', 'items']

Groups

We also have the concept of groups. To start using groups we need to wrap our pattern in parenthesis like so:

const matchedGroup = 'orders/114'.match(/(?<order>\d+)/) // [114, 114, groups: { order: 114 }]  

The usage of the construct ?<order> creates a so-called named group.

Flags

There are different flags. Let's list some of them. All flags are added at the end of the Regular expression. So a typical usage looks like this:

var re = /pattern/flags;
  • g, what you are saying is that you want to match the entire string, not just the first occurrence
  • i, this means we want a case insensitive matching

Assertions

There are different types of assertions:

  • Boundary, this is for matching things in the beginning and the end of a word
  • Other assertions, here we are talking about look ahead, look behind and conditional assertions

Let's look at some examples:

/^test/.test('test123') // true

Above we are testing the string test123 whether it starts with ^ the word test.

The reverse would look like this:

/test$/.test('123test')

Character classes

Character classes are about different kinds of characters like letters and digits. Let's list some of them:

  • ., matches any single character except for line terminators like \n or \r
  • \d, matches digits, equivalent with [0-9]
  • \D, this is a negation of matching a digit. So anything, not a digit. Equivalent to ^[0-9]
  • \w, matches any alphabetic character including _. Equivalent with [a-zA-Z0-9_]
  • \W, a negation of the above. Matches a % for example
  • \s, matches white space characters
  • \t, matches a tab
  • \r, matches a carriage return
  • \n, matches a line feed
  • \, escape character. It can be used to match a / like so \/. Also used to give characters special meaning

Quantifiers

Quantifiers is about the number of characters to match:

  • *, 0 to many characters
  • +, 1 to many characters
  • {n}, match n characters
  • {n,}, match >= n characters
  • {n,m}, match >= n && =< m characters
  • ?, non-greedy matching

Let's look at some examples

/\w*/.test('abc123') // true
/\w*/.test('') // true. * = 0 to many

In the next example we use the ?:

/\/products\/?/.test('/products')
/\/products\/?/.test('/products/')

Above we can see how the usage of ? makes the ending / optional when we use this type of matching \/?.

 DEMO

Ok, that's a lot of theory mixed with some examples. Let's look at some realistic matching next, matchings that we would actually use in production.

If you are using JavaScript on the backend you are probably already using something frameworks like Express, Koa or maybe Nest.js. Do you know what these frameworks do for you in terms of route matching, parameters and more? Well, it's about time to find out.

Matching a route

A route as simple as /products, how do we match it?. Well, we know our URL should contain that part with that so writing a RegEx for that is quite simple. Let's also account for that some will type in /products and some other will type /products/:

/\products\/?$/.test('/products')

The above RegEx fulfills all our needs from matching / with \/to matching an optional / at the end with \/?.

 Extract/match route parameter

Ok, let's take a similar case. /products/112. The route /products with a number at the end. Let's start to see if the incoming route matches:

/\/products\/\d+$/.test('/products/112') // true
/\/products\/\d+$/.test('/products/') // false

To extract the route parameter we can type like this:

const [, productId] = '/products/112'.match(/\/products\/(\d+)/)
// productId = 112

 Match/extract Several route parameters

Ok, let's say you have a route looking like this /orders/113/items/55. This roughly translates to order with id 113 and with order item id 55. First we want to ensure that our incoming URL matches so let's look at the RegEx for that:

/\orders\/\d+\/items\/\d+\/?/.test('/orders/99/items/22') // true

The above RegEx reads as the following, match /orders/[1-n digits]/items/[1-n digits][optional /]

Now we know we are able to match the above route. Let's grab those parameters next. We can do so using named groups:

var { groups: { orderId, itemId } } = '/orders/99/items/22'.match(/(?<orderId>\d+)\/items\/(?<itemId>\d+)\/?/)
// orderId = 99
// items = 22

The above expression introduces groups by creating named groups orderId and itemId with constructs (?<orderId>\d+) and (?<itemId>\d+) respectively. The pattern is very similar to the one used with the test() method.

 Route classifier

I'm sure you've seen how a route has been split up into several parts like protocol, host, route, port and query parameters.

That's quite easy to do. Let's assume we are looking at a URL looking like this http://localhost:8000/products?page=1&pageSize=20. We want to parse that URL and ideally get something nice to work with, like this:

{
  protocol: 'http',
  host: 'localhost',
  route: '/products?page=1&pageSize=20',
  port: 8000
}

How do we get there? Well, what you are looking at follows a very predictable pattern and RegEx is the Mjolnir of Hammers when it comes to pattern matching. Let's do this :)

var http = 'http://localhost:8000/products?page=1&pageSize=20'
.match(/(?<protocol>\w+):\/{2}(?<host>\w+):(?<port>\d+)(?<route>.*)/)

// http.groups = { protocol: 'http', host: 'localhost',  port: 8000, route: '?page=1&pageSize=20'   }

Let's take the above and break it down:

  • (?<protocol>\w+):, this matches n number of alphabetic characters that ends with a :. Additionally, it's getting placed into the named group protocol
  • \/{2}, this just says we have //, typically after http://.
  • (?<host>\w+):, this matches n number of alphabetic characters that ends with a :, so in this case, it matches localhost. Additionally, it's getting placed into the named group host.
  • (?<port>\d+), this matches some digits that follow after the host which would be the port. Additionally, it's getting placed into the named group port.
  • (?<route>.*), lastly, we have the route matching which just matches any characters which would ensure we get the part ?page=1&pageSize=20. Additionally, it's getting placed into the named group route.

To parse out the query parameters we just need a RegEx and one call to reduce(), like so:

const queryMatches = http.groups.route.match(/(\w+=\w+)/g) // ['page=1', 'pageSize=20']
const queryParams = queryMatches.reduce((acc, curr) => {
  const [key, value] = curr.split('=')
  arr[...arr, [key]: value ]
}, {}) // { page: 1, pageSize : 20 }

Above we are working with the response from our first pattern matching http.groups.route. We are now constructing a pattern that would match the following [any alphabetic character]=[any alphabetic character]. Additionally, because we have a global match g, we get an array of responses. This corresponds to all of our query parameters. Lastly, we call reduce() and turn the array into an object.

 Password complexity

The thing with password complexity is that it comes with different criteria like:

  • length, it should be more than n characters and maybe less than m characters
  • numbers, should contain a number
  • special character, should contain special characters

Are we safe then? Well safer, don't forget 2FA, on an app, not your phone number.

Let's look at a RegEx for this:

// checking for at least 1 number
var pwd = /\d+/.test('password1')

// checking for at least 8 characters
var pwdNCharacters = /\w{8,}/.test('password1')

// checking for at least one of &, ?, !, -
var specialCharacters = /&|\?|\!|\-+/.test('password1-')

As you can see I construct each requirement as its own pattern matching. You need to take your password through each of the matchings to ensure it's valid.

The perfect date

In my current job I encounter colleagues who all think their date format is the once the rest of us should use. Currently, that means my poor brain has to deal with:

// YY/MM/DD , European ISO standard
// DD/MM/YY , British
// MM/DD/YY,  American, US

So you can imagine I need to know the nationality of the one who sent me the email every time I get an email with a date in it. It's painful :). So let's build a RegEx so we can easily swap this as needed.

Let's say we get a US date, like so MM/DD/YY. We want to extract the important parts and swap the date so someone European/British can understand this. Let's also assume that our input below is american:

var toBritish = '12/22/20'.replace(/(?<month>\d{2})\/(?<day>\d{2})\/(?<year>\d{2})/, '$2/$1/$3')
var toEuropeanISO = '12/22/20'.replace(/(?<month>\d{2})\/(?<day>\d{2})\/(?<year>\d{2})/, '$3/$1/$2')

Above we are able to do just that. In our first parameter to replace() we give it our RegEx. Our second parameter is how we want to swap it. For a British date, we just swap month and day and everybody is happy. For a European date, we need to do a bit more as we want it to start with a year, followed month and then day.

Email

Ok so for email we need to think about a few things

  • @, should have an @ character somewhere in the middle
  • first name, people can have long names, with and without a dash/hyphen. Which means people can be called, per, per-albin and so on
  • last name, they need a last name, or the email is just a last name or a first name
  • domain, we need to white list several domains like .com, .gov, .edu

With all that in mind, I give you the mother of all RegEx:

var isEmail = /^(\w+\-?\w+\.)*(\w+){1}@\w+\.(\w+\.)*(edu|gov|com)$/.test('per-albin.hansson@sweden.gov')

Let's break this down, cause it's wordy:

  1. ^, this means it starts with.
  2. (\w+\-?\w+\.)*, this one means a word with our without - as we have the pattern -?, and ending with a ., so per., per-albin.. Also, we end with * so 0 to many of that one.
  3. (\w+){1}, this one means exactly one word like an email consisting of just a last name or just a first name. This opens for for a combination of 1) + 2) so per-albin.hansson or per.hansson or 2) alone which would per or hansson.
  4. @, we need to match one @ character
  5. \w+\., here we are matching a name that ends in ., e.g sweden.
  6. (\w+\.)*, here we are opening up for a number of subdomain or no one, given the *, e.g sthlm.region. etc.
  7. (edu|gov|com), domain name, here we are listing allowed domains to be edu, gov or com
  8. $, needs to end with, this means we ensure that someone doesn't input some crap after the domain name

Summary

You got all the way here. We really covered a lot of ground on the topic of RegEx. Hopefully, you now have a better grasp of what components it consists of. Additionally, I hope the real-world examples made you realize that you might just not need to install that extra node module. Hopefully, you will with a little practice feel like RegEx is useful and can really make your code a whole lot shorter, more elegant and even readable. Yes, I said readable. RegEx is quite readable once you get the hang of how things are being evaluated. You will find that the more time you spend on it the more it pays off. Stop trying to banish it back to a Demon dimension and give it a chance :)

Posted on Jun 29 by:

softchris profile

Chris Noring

@softchris

https://twitter.com/chris_noring Cloud Developer Advocate at Microsoft, Google Developer Expert

Discussion

markdown guide
 

I know that pain (and I've bookmarked this article for the next time I feel it!)

The only thing I'd point out is that in your example of making a regex at the top using the constructor you are using a regex as the parameter - I'd say the real benefit of the constructor version is that it can take a plain string - not only an existing regex. Also it's new RegExp() in JS.

     const myRegex = new RegExp(`(${magicWord})|(blah)`, 'i')
 

hi Mike. Thanks for the feedback.. Yea definitly, it needs escape signs though, so it's not 1-1 replacement. I learned that the hard way :)

 
 

Great post. The information is very useful. 💯
Quick newbie questions: please tell me the examples of line terminators?

 

hi Jane, can you please tell me more in detail what you are after?

 

In regex there is /./ that matches all characters apart from line terminators.
Examples of line terminators are...?

line terminators are \n and \r.. If you have code like this that matches a string 'aa \n bb'.match(/.*/), It only matches aa, as match function only runs on the first line (\n separates line one and two). However if you use the flag g it matches all rows. So this code 'aa \n bb'.match(/.*/g) would match aa and bb

 

Try using vim as an editor for a while and you will end up getting very used to writing regular expressions (if not, you're not using vim correctly). It's very nice for both learning and continuously practicing it.

 

Shameless plug, but I have a post about Regexes here at Dev too. It's more about how it works internally and perhaps also can hint an answer to why using regex to check if your password is strong enough is not necessary the best idea.

 

Great post! I have to parse very long and poorly formatted strings for work, and thankfully found RegEx before digging myself into a hole of split() functions! Pretty powerful stuff

 

Thanks. Yea, it completely changed the way I do code. I used to split() a lot before it. Still need split once in a while but RegEx takes you far.

 

knows regex
*also likes to live dangerously

 
 

This is great! Regex still seems like a super power to me XD

 

yep, I know, feels like I'm very a cape on a daily basis :)

 

I strongly endorse executeprogram.com/courses/regexes

The spaced repetition helped me internalize regex once and for all after struggling with it for years

 

A useful website to help you practice: regexr.com/