👋 Introduction
Throughout my Self-Taught-Programmer-Relying-on-Stack-Overflow Career, I encountered regular expressions many times, without having any idea what they were or how they worked. To me, they looked like a bunch of garbled characters that would be impossible to decipher. How could r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
possibly be read to match a string?
This example seems like a pretty long regular expression (or regex), but it's also probably the most common: an email address validator. I'm going to explain exactly how this and other regular expressions work and what all the little characters mean so you can build your own.
Turns out - they're pretty simple to learn and fun to play with once you've got the hang of it! Not to mention, regex can boil down multiple lines of code logic to decipher a string into one simple expression.
🤓 What Are Regular Expressions?
At its core, a regular expression is a sequence of characters that forms a pattern that can be matched against any string of text. These patterns are used to search, match, and manipulate text based on certain rules. Regular expressions provide a concise and flexible means of expressing complex text patterns. They assist tasks such as text validation, search, extraction, and more.
📋 Basic Syntax and Characters
Regular expressions are read left to right, like the strings they represent. Regex breaks the string being searched into pieces and evaluates each piece against a piece of the regular expression itself.
This is how r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
can represent each piece in an email address. An email address's pieces would look like [text]@[text].[text]. We'll look closer at this example later.
Regular expressions consist of various elements, including letters, numbers, brackets, and other characters. Here's a brief (NOT comprehensive) overview of some essential inputs and their meanings:
/ Literal Characters /
Literal characters give an exact match of the piece being evaluated. They are surrounded by forward slashes for readability, but don't always have to be. For example, if we only wanted to accept emails that ended in .com domains, the last piece of our regular expression would be /.com/
.
[ Character Classes ]
Character classes give a group of optional characters for the current piece to match. Below are some examples of character classes.
Input | Meaning |
---|---|
[ab12] |
one instance of any character a, b, 1, or 2 |
[^yz89] |
one instance of any character besides y, z, 8, and 9 |
[a-p] |
any character between lowercase a and p inclusive |
[A-Z] |
any uppercase letter |
[1-5] |
any number between 1 and 5 inclusive |
[0-9] |
any numeric character |
[a-z0-4] |
any lowercase letter or a number between 0 and 4 inclusive |
[A-z] |
any upper or lowercase letter |
In our example, [a-zA-Z0-9._%+-]
refers to the user name of the address and it will accept any character that is alphanumeric as well as the characters . _ % +
and -
.
? Quantifiers *
Quantifiers specify the length of the current piece of text.
Input | Meaning |
---|---|
? |
zero or one occurence of the preceding character or class |
* |
zero or more occurences of the preceding character or class |
+ |
one or more occurences of the preceding character or class |
{5} |
exactly 5 occurences of the preceding character or class |
{2, 5} |
between 2 and 5 occurences of the preceding character or class |
{7, } |
7 or more occurences of the preceding character or class |
^ Metacharacters $
Metacharacters are special characters with a predefined meaning, but many of them can be escaped with a backslash, as seen in the following section. Below are a few of the most common metacharacters.
Input | Meaning |
---|---|
. |
any character |
^ |
signifies the beginning of a line |
[^] |
negates other characters when used inside brackets |
$ |
signifies the end of a line |
`\ | ` |
\ Escaped Metacharacters
Escaped metacharacters transform the meaning of other characters to capture other types of expressions found in a string of text. They can be used independently or within a character class. Below are the most common examples. Notice that a capital letter negates the value of its lowercase counterpart.
Input | Meaning |
---|---|
\. |
accepts a period (.) |
\s |
accepts whitespace ( ) |
\S |
accepts any character than isn't whitespace ( ) |
\d |
accepts any digit 0-9 |
\D |
accepts any character other than digits 0-9 |
\w |
accepts letters, numbers, and underscores |
\W |
accepts anything other than letters, numbers, and underscores |
📧 Breaking Down an Email Validator
Let's break down the email address validator from before.
An email address is comprised of three main pieces: the user name and two pieces of the domain name separated by a period.
[user name]@[domain].[com]
Without regex, we might solve this problem like so:
import string
def is_valid_email(email):
if '@' not in email:
return False
parts = email.split('@')
if len(parts) != 2:
return False
local_part, domain_part = parts[0], parts[1]
if len(local_part) == 0 or len(domain_part) == 0:
return False
if ' ' in email:
return False
if local_part[0] == '.' or local_part[-1] == '.':
return False
if domain_part[0] == '.' or domain_part[-1] == '.':
return False
domain_parts = domain_part.split('.')
if len(domain_parts) < 2:
return False
for part in domain_parts:
if len(part) == 0:
return False
disallowed_chars = ['?', '!', '#', '$', '%', '^', '&', '*', '(', ')', '[', ']', '{', '}', '<', '>', ',', ';', ':', '/', '\\']
for char in disallowed_chars:
if char in email:
return False
return True
With regex, however, we can boild this function down to 2 lines:
import re
def is_valid_email(email):
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return re.match(pattern, email) is not None
print(is_valid_email("example@email.com")) # Output: True
Let's go through the regex piece by piece.
-
^
beginning of the string -
[a-zA-Z0-9._%+-]
allow any alphanumeric character or one of those special characters -
+
make sure at least one of the characters in the class from Step 2 is present (in the user name) -
@
require an @ symbol before the next piece -
[a-zA-Z0-9.-]
allow any alphanumeric character, a period, or a dash -
+
make sure at least one of the characters in the class from Step 5 is present (in the first half of the domain) -
\.
require a period before moving onto the next piece -
[a-zA-Z]
accept any alphanumeric character -
{2,}
require at least 2 of the characters in the class from Step 8 (in the last half of the domain) -
$
end of the string
re.match
simply checks the entirety of the string given to email to see if it is accepted by our given regex pattern. More on regex functions below.
Boom! You've validated an email using regex! Much nicer than writing a whole validation function, right?
Now think about this: How would you build the validator to only accept addresses with a .com domain name?
Hint: test out your answer with regex101.com
🖥️ Regex Functions
Regex can be used to match, search, substitute, and extract pieces of a text string. I won't be going over those here, but there is great documentation for utilizing these functions. See Resources below.
📈 Conclusion
Regex supports advanced techniques such as capturing groups, lookahead and lookbehind assertions, and backreferences. This was by no means a comprehensive guide to regex.
In fact, most languages have packages that utilize slightly different versions of regex. Make sure to read the official documentation for the package version you're using and experiment with some examples yourself.
Regular expressions are a versatile tool for text processing tasks. While the syntax may seem daunting to new programmers at first glance, once you understand them they become fun tools with which to solve complex problems.
Top comments (0)