DEV Community

Cover image for Regex isn't that hard
Dinys Monvoisin
Dinys Monvoisin

Posted on • Edited on

Regex isn't that hard

Regex is the thing that you only learn when you need it. Unless you are processing a considerable amount of data, you likely won’t use it.

Does that imply that, as a software engineer, we should forget about it and worry about it when that time comes? Are we not supposed to take responsibility to learn it?

Programmers think that Regex is hard. As with every skill, it requires practice to master. To help you with it, I wrote this article to cover the basics of Regex and show a simple application of how you can use it.

Content

  • Reasons to learn Regex
  • Understand Regex
  • Regex structure and special characters
  • Example using Regex and JavaScript
  • Resources

Reasons to learn Regex

Stuck in limbo, googling about the Regex pattern to the problem we are trying to solve. Does this sound familiar? I bet at least one of you were in a comparable situation before. But, don't you think it would be easier to know the in and out of Regex? Indeed, this would have reduced the time searching for answers.

Regex provides a more concise way of solving problems that need some form of parsing. An example is the split function. Turning your string into tokens before applying some sort of logic is lengthy to put in place. Turnouts that this implementation is limited compared to using Regex.

Hopefully, the next part excites you as we are going to cover more of Regex.

Understand Regex

Regex is also called regulation expression. It is a set of string characters that define an expression for the patterns of data you are looking for. It has been there for a long time, since the 1980s, and its primary use was for searching and parsing strings.

An example of Regex for looking for email address having a ".com" domain can be: /.+@.+\.com/.

Don't worry if it does not make sense now. In the next part I will cover what the characters in the above expression mean.

Regex structure and special characters
The first thing to know is that there are two ways to define a Regex pattern:
Using a regular string literal

var pattern = /abc/

Calling RegExp constructor

var pattern = new RegExp('abc')

When to use which? Regular string literal is when you know the pattern in advance. Contrarily, RegExp constructor when you use dynamic data during runtime.

Special characters in Regex extend the ability to create more complex Regex pattern. Let's look at some fundamental ones.

The string, "From: dinys18@dinmon.tech", will be used in each of the below scenarios. And to give the result of the Regex pattern, an arrow will be used. But in no way this will work using JavaScript.

^ - The caret symbol matches the start of a string

var re = /^ From: / => From:

$ - The dollar sign symbol matches the end of a string

var re = /tech$/ => tech

. - The period character matches any single character

var re = /.@/ => s@ // Any single character and @ sign

[0-9] - Character set. Matches any character enclosed with the brackets.

var re = /[0-9]/ => 1 and 8, not to be confused by 18

* - Asterisk character matches any character before it, at least one, i.e., either zero or one.

var re = /.*:/ => From: // Any multiple of character until semi column

+ - Plus sign character matches any character before it, one or more times.

var re = /@[a-z]+/ => dinmon // Start at @ sign, include any multiple of lowercase characters

Lastly, characters like asterisks, plus sign and period are special characters in Regex. What if you wanted to use them in your regular Regex expression. Thankfully there is a way by using special characters in your pattern, you would need to escape them. Meaning adding \(slash) in front of them, so that they are no longer considered as special characters, but as the regular character.

var re = /\..*/ => .tech // Start at the period character, include any characters afterwards

Now that we have covered various ways to construct a regular expression let's go ahead and combined it with JavaScript. That will allow us to perform more complex operations like extraction, replacement and so forth.

Example using Regex and JavaScript

In this section I will cover how to use Regex combined with JavaScript to perform an extraction onto a string. For that, I will implement a file simulator that allows the creation of duplicate folder names.

So to avoid duplicate folder name, we need to append a string to the folder name to make the new folder’s name unique. For this will add an index enclosed in brackets to represent the number of times the folder is duplicated.

Before we start constructing the regular expression, let's start breaking down the various scenarios to handle:
A folder's name with any characters, e.g, python
A folder's name with any characters and a digit enclosed in a bracket, e.g python (0)

First, we need to get the of the duplicated folder's name with any characters.

var regex = /.+/

Then look for the enclosed bracket with a number.

var regex2 = /\([0-9]+\)/

You will notice that we escaped the two brackets that surround the number by using a slash. In the middle of the enclosed bracket, we used a character set from zero to nine to define a number. As we need more that one number, we added the plus sign to cater for numbers of two or more digits.

This sounds good but isn’t it redundant to use two Regex expression on a single string we are trying to pass? What if we could do that in one line? To achieve this, will extract both the folder’s name and the number using the curly brackets around them.

The final expression will look like:

var regex = /(.+) \(([0-9]+)\)/

To execute the Regex expression, call the match function with the above expression as an argument.

var name = 'Folder (0)'
var matchFound = name.match(regex) => ['Folder (0)', 'Folder ', '0']

The above result of match function will return null if no value found or the values extracted. Check the match() function reference for more detail.

Note: The first value of the array will be the string you passed in, and the rest is the extracted values.

I leave the next part for you to complete so that the function getDuplicateName return the folder’s name and the index at the end of the folder if it is a duplicate.

function getDuplicateName(list, name) {
            var regex = /(.+) \(([0-9]+)\)/  
            var matchFound = name.match(regex) ?? []

            var [, baseName, index] = matchFound;

            var isDone = (matchFound.length > 0) ? !(!!baseName) : !list.includes(name) 
            var count = index ? Number(index) + 1 : 0
            var newName = name
            baseName = baseName ?? name

            while (!isDone) {
                newName = `${baseName} (${count})` 
                if (!list.includes(newName)) {
                    isDone = true
                    continue
                }
                count++
            }

            return newName
        }

Resources

If you want to look at the full source code, visit the GitHub repository or the demo of the file simulator.

screenshot-mockup(3).png

If you like what you read, consider following on Twitter to find valuable content.

Top comments (24)

Collapse
 
pinotattari profile image
Riccardo Bernardini • Edited

Am I the only one in the world who actually loves regular expressions? I learned them when I was studying about compilers and I always found them a very powerful tool, not necessarily for niche applications or very large amount of data.

I use them to build "tokenizers" or extract information from text files in just a couple of line of codes (in Ruby, mostly), for example (the first thing that came to my mind)

   $stdin.each do |line|
       next unless line =~ /^([a-z]+) *: *(.*)$/
       name=$1
       value=$2
    end 

The syntax is not great, I agree, it looks much like line noise. I always wondered about an alternative syntax, but everything I tried (not much, to be honest) was not a really huge improvement.

Oh, yes, and let's not forget search-and-replace-regexp in emacs... You can do wonderful stuff with a single command.

Collapse
 
dinmon profile image
Dinys Monvoisin

This should be a great job working on regular expression all day. But it's not my cup of tea. I would prefer a mixing Regex with some sort of development.

Oh, is Emacs your favourite editor then? I can't image how good it will be to customise a replace operation using Regex.

Collapse
 
pentacular profile image
pentacular

Regex can be useful, but can also be a trap.

When you use regex, be sure to use just enough abstraction that you can swap out the regex implementation with a parser later on.

There are three main traps with regex:

  1. Regex do not handle recursive structures.
  2. Regex do not handle irregular languages.
  3. Regex scale rapidly to become impossible for a human to understand.

It is quite difficult to predict when you'll hit one of these limits, so a little abstraction goes a long way.

Instead of putting regex directly in your code, abstract them with a procedure that does something: e.g., getName(foo) instead of (foo.match(/([^/]+)/) || [])[1]; :)

Collapse
 
cmohan profile image
Catherine Mohan

Great article! I've been using RegEx patterns a lot recently in Powershell as some commands return values as a very long string instead of a proper object. RegEx patterns make pulling the data much easier. I use RegEx 101 to help build my pattern strings. It has very helpful color coding and a dictionary of all the different RegEx operators.

Collapse
 
dinmon profile image
Dinys Monvoisin

Thank you, Catherine, for contributing to this article and providing the readers with addition resources. I wonder what you were using Regex for in PowerShell. Are you using Grep?

Collapse
 
cmohan profile image
Catherine Mohan • Edited

No, Powershell can use RegEx natively for working with strings. I mostly use Select-String -Pattern to pull substrings out of large string responses. Some string commands even use it by default and you have to remember that or else you'll be a bit confused why some of your code is not responding the way you hope.

-split and -replace will use RegEx to match strings, but .Split() and .Replace() don't. So "catherine.mohan" -split "." returns all the characters and ("catherine.mohan").Split(".") returns catherine and mohan as expected.
You can escape the period and it'll work too.

Thread Thread
 
dinmon profile image
Dinys Monvoisin

Is your file sparsely found everywhere that's why you are using command line? Often you will just use a program to do all these.

Thread Thread
 
cmohan profile image
Catherine Mohan

I'm not sure what file you're referring to. I use RegEx in the Powershell CLI, Powershell scripts, and in Powershell apps that I create. Mostly for parsing strings, and occasionally for searching strings. Powershell commands usually return objects with properties, but recently I've had to use some commands that return objects with a single property that is just a long string with all the values in a list Since I can't use the typical $object.property notation to get values, I have to use RegEx to parse the giant string looking for the values I need.

Thread Thread
 
dinmon profile image
Dinys Monvoisin

Oh, using a string itself in PowerShell. Interesting. May I have more context about the application of it?

Thread Thread
 
cmohan profile image
Catherine Mohan

Sure! One of the recent times I've used regular expressions is when I needed to search the Windows Event Logs. In the GUI, you can only reliably search by Event ID even though the actual event has lots of info. You can get that info with the Get-EventLog Powershell command. It's all in the Message property, but that property is just a very long string even if it looks like this:

Computer: comp-01
User: catherine.mohan
CreationTime: 9/12/2020 9:31:00 PM

Since I can't save it to a variable and access it like $var.User as you would expect, I have to do this instead to get the User value.

$matches = $event.Message | Select-String -Pattern "User: (.*?)\n"
$matches.Matches.Groups[1]

# Output: catherine.mohan

If I need the same info from a lot of results, I will make arrays of my own custom objects so I only have to do the matching process once in a loop. Now that I can get the values, I can use them to filter the results and search for the events I need with greater accuracy.

Thread Thread
 
dinmon profile image
Dinys Monvoisin

Wow, that's so cool. I did not know that you could access EventLog through PowerShell. Thanks for sharing. I will try to explore interesting stuff you can do with PowerShell when I have time.

Collapse
 
spez profile image
Abhigyan
NodeJS 14.1 REPL

> Regex + JS == RegJSex;
> true
Collapse
 
dinmon profile image
Dinys Monvoisin

You make me laugh. It's good to have comments like that sometimes.

Collapse
 
spez profile image
Abhigyan

I think that I am a very frank guy. I always make things too funny that it appears easy, even if it's difficult.

Collapse
 
jrbrtsn profile image
John Robertson

Regex is invaluable for software dev and sysadmin. Just put in the effort to learn it - I guarantee it will be worth your time.

Collapse
 
dinmon profile image
Dinys Monvoisin

Try to tell that to the new people learning programming. All of them learning about web development to only create beautiful screen.

Collapse
 
jrbrtsn profile image
John Robertson

I suppose it all depends on your goals. Web development using $WEB_DEV_PLATFORM_DU_JOUR creates an initial perception of rapid progress. If you are one of us that is tasked with completing a complex project all the way to sustainable production, then traditional computer science concepts and tools become essential.

Thread Thread
 
dinmon profile image
Dinys Monvoisin

With rapid changing market, it is better to build a quick and dirty prototype. However, I do agree that a strong understanding of CS concepts is fundamental.

By the way, do you speak French? What with "DU_JOUR"?

Thread Thread
 
jrbrtsn profile image
John Robertson

"it is better to build a quick and dirty prototype'
Again, this depends on your goals. As a consultant I only get paid for a working product. If you are working as an employee, then the best strategy is to throw together the prototype, get some kudos, and move on to the next project.
Sadly I do not speak French. "du jour" is a French phrase adopted by English speakers for some time.

Collapse
 
belinde profile image
Franco Traversaro

Pay attention, the first example is wrong: /.+@.com/ matches something like "name@xcom". A better example (still not covering a lot of peculiar cases) could be /.+@.+\.com/

Collapse
 
dinmon profile image
Dinys Monvoisin

Thank you for pointing this mistake. I guess many people do not read thoroughly.

Collapse
 
madza profile image
Madza

To me regex has always been regexr.com or regex101.com. 😄

Collapse
 
merri profile image
Vesa Piittinen

I do an awful joke.

What is the difference between CoffeeScript and RegExp?

You can actually understand what the author of the code was going for when reading RegExp.

Collapse
 
dinmon profile image
Dinys Monvoisin

Hahaha this was a good one. As CoffeScript is trying to simply JavaScript, it gets hard to read sometimes.