Regular expressions (regex) are one of those things that folks seem to make fun of most of the time because they don't understand them, or partially understand them.
I decided to write this post after Ben Hong Tweeted out asking for good regex resources.
Is this post going to make you a regex expert? No, but it will teach some of the pitfalls that developers succumb to when writing them.
The example code snippets shown in the post will be for regular expressions in JavaScript, but you should be able to use them in your language of choice or at least the concepts if the syntax is slightly different.
Be Specific
Know exactly what you're looking for. This may sound obvious on the surface, but it's not always the case. Let's say I want to find instances of three
in a text file because we need to replace all instances of three
with the number 3
. You've done a bit of Googling and or checked out regex101.com. You're feeling pretty good so you write out this regular expression.
const reMatchThree = /three/g
Note: If you're new to regular expressions, everything between the starting /
and the ending /
is the regular expression. The g
after the last /
means global, as in find all instances.
You run the regular expression to match all instances of three
so it can be replaced with 3
. You look at what got replaced in the text and you're a little perplexed.
- There were three little pigs who lived in their own houses to stay safe from the big bad wolf who was thirty-three years old.
+ There were 3 little pigs who lived in their own houses to stay safe from the big bad wolf who was thirty-3 years old.
three
got replaced by 3
everywhere in the file, but why was thirty-three replaced? You only wanted three
s replaced. And here we have our first lesson. Be specific. We only want to match when it's only the word three
. So we need to beef up this regex a little. We only want to find the three
when it's the first word in a sentence, has white space before and after it or some punctuation before and/or after it, or if it's the last word in a sentence. With that criteria, the regex might look like this now.
const reMatchThree = /\b(three)\b/g
Note: Don't worry if you're not familiar with all the syntax. The \b
character means a word boundary character.
When parts of a regex are contained by parentheses, it means a group, and what's in that group will return as a group as part of the match.
Don't Be Too Greedy
Greed is usually not a good thing and greed in regex is no exception. Let's say you're tasked with finding all the text snippets between double quotes. For the sake of this example, we are going to assume the happy path, i.e. no double quoted strings withing double quoted strings.
You set out to build your regex.
const reMatchBetweenDoubleQuotes = /"(.+)"/g
Remember that (
and )
represent a group. The .
character means any character. Another special character is +
. It means at least one character.
You're feeling good and you run this regex over the file you need to extract the texts from.
Hi there "this text is in double quotes". As well, "this text is in double quotes too".
The results come in and here are the texts that the regex matched for texts within double quotes:
this text is in double quotes". As well, "this text is in double quotes too
Wait a minute!? That's not what you were expecting. There are clearly two sets of text within double quotes, so what went wrong? Lesson number two. Don't be greedy.
If we look again at the regex you created, it contains .+
which means literally match any character as many times as possible, which is why we end up matching only this text is in double quotes". As well, "this text is in double quotes too
because "
is considered any character. You got greedy, or more specifically the regex did.
There are a couple of ways to approach this. We can use the non-greedy version of +
, by replacing it with +?
const reMatchBetweenDoubleQuotes = /"(.+?)"/g
Which means find a "
, start a capturing group then find as many characters as possible before you hit a "
Another approach, which I prefer, is the following:
const reMatchBetweenDoubleQuotes = /"([^"]+)"/g
Which means find a "
, start a capturing group then find as many characters as possible that aren't "
before you hit a "
.
Note: We've introduced some more special characters. [
and ]
are a way to say match any of the following characters. In our use case, we're using it with ^
, i.e. [^
, to say do not match any of the following things. In our case, we're saying do not match the "
character.
Focus on What You’re Searching For
Now that we’ve gone through some common pitfalls, it’s worth noting that it’s OK to be greedy or not be as specific. The main thing I want you to take away is to really think about what you’re searching for and how much you want to find.
Regexes are super powerful for manipulating text, and now you’re armed with some knowledge you can put in your regex tool belt! Until next time folks!
Resources
- regex101.com
- regular-expressions.info
- Mastering Regular Expression 3rd Edition
- Regular Expressions | MDN
- regexper (Thanks @link2twenty!)
- VerbalExpressions repository (Thanks @citizen428!)
Top comments (14)
Nice Article. Will there be more? Was thinking about writing a beginners guide myself.
I'd like to emphasize a bit more how tools like regextester.com regex101.com can be a great resource to learn by looking at the existing expressions or playing with your own ones and then hovering with your mouse over the expression to get an explanation of what's happening.
(btw: s missing in "Matering..." link at the end)
This was a one off, but if folks want to see more about regexes, I'd be happy to write some more about them. 😎
If you have a beginner's guide in mind, go for it! A different perspective on a topic can only be good for the community!
Also, thanks for catching the typo! I wrote this post late last night lol.
Would love a beginner's guide!
Fascinating how just one short comment like yours gets one out of procrastination and into being productive! :D
media.giphy.com/media/rhC8duvjyYNh...
dev.to/mktcode/regular-expressions...
I found there are already so many good resources for technical people, so I tried to write something specifically for non-technical people. Hope it works.
Hey
Thanks for the article! Great introduction!
I wanted to write it on it because I struggled to understand it at the start, but didn't how to start because it's pretty broad subject (to understand I had to learn several not intuitive concepts and make a lot of tests). Now I have some good basics. I didn't know the "+?" combination trick that's great.
The app that helped me a lot (and that you can maybe add to the resource list) is regexr.com (it's opensource!). You can even save your patterns (like this one, publish and browse existing patterns, read a cheat sheet, and the colours and interface really help to understand. I really like the "explanations on hover with selections".
Another websites that looks fun learn-regex.com and regex-one.com.
A few suggestions to enhance it:
Thanks for the feedback Samuel!
I quite like Regexper you paste in a regex and it turns it into a nice railroad diagram.
Thanks for the share Andrew!
Didn't read all, just till:
Why don't you simply use:
const reMatchThree = /\b(three)\b/g
?
I completely forgot about word boundaries while I was writing it late at night lol. I’ve updated the article. Thanks for this. 😎
GREAT Intro!
[[ Pingback ]]
This article was curated in #17th issue of Software Testing Notes .
softwaretestingnotes.substack.com/...