DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’» is a community of 963,503 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

Create account Log in
abhinav2499
abhinav2499

Posted on

Pattern Matching with Regular Expressions in Python: Part 1

We’re all familiar with searching for text by pressing CTRL + F and typing in the words that we’re looking for. Regular expressions go one step further: they allow us to specify a pattern of text to search for. Someone may not know a business’s exact landline number, but if they live in India, they know it’ll be 2 to 4 digits, followed by a hyphen, and then seven more digits. This is how we, as a human, know a phone number when we see it: 524-2768494 is a landline number, but 5,242,768,494 is not.
Regular expressions are helpful, but not many people know about them even though most of the modern text editors and word processors such as Microsoft Word or LibreOffice, have find and find-and-replace features that can search based on regular expressions. They are huge time-savers not just for software users but also for programmers.
This series is divided into two parts, in this part 1, we’ll start by writing a program to find text patterns without using regular expressions and then see how to use regular expressions to make the code much concise and clean. We’ll use basic matching with regular expressions and then in the next part we’ll move on to some of its more powerful features.
Finding Patterns of Text Without Regular Expressions
Let’s say we want to find a landline number in a string which contains 3 numbers, followed by a hyphen and then 7 more numbers such as 524-2768494.
Let’s use a function named isPhoneNumber() to check whether a string matches this pattern, returning either True or False. Open a Jupyter notebook and enter the following code

Image description
The isPhoneNumber() function has code that does several checks to see whether the string in text is a valid landline number. If any of these checks fail, the function returns False. First, the code checks that the string is exactly 11 characters. Then, it checks that the area code (that is, the first three characters in text) consists of only numeric characters. The rest of the function checks that the string follows the pattern of a landline number. The number must have a hyphen followed by seven more numeric characters.
Calling isPhoneNumber() with the argument β€˜524-2768494’ will return True while calling isPhoneNumber() with β€˜Abhinav Yadav’ will return False as the first test fails because β€˜Abhinav Yadav’ is not 11 characters long.
We would have to add even more code to find this pattern of text in a larger string, it could be millions of characters long and the program would still run in less than a second similar to a program with regular expression but regular expressions make it quicker to write these programs.
Finding Patterns of Text With Regular Expressions
The previous landline number finding program works, but it uses a lot of code to do something limited: This isPhoneNumber() function is 12 lines long but can find only one pattern of phone numbers. What about a phone number formatted like 0524-2768494 or (524) 2768494? What if the phone number only had two digits before the hyphen, like 11-2768494? The isPhoneNumber() function would fail to validate them. We could add yet more code for these additional patterns, but there’s an easier way.
Regular expressions, called regex for short, are descriptions for a pattern of text. For example, a \d in a regex stands for a digit character, that is, any single numeral from 0 to 9. The regex \d\d\d-\d\d\d\d\d\d\d is used by Python to match the same text the previous isPhoneNumber() function did: a string of three numbers, a hyphen followed by seven more numbers. Any other string would not match the \d\d\d-\d\d\d\d\d\d\d regex.
But a regex can be more sophisticated as well. For example, adding a 6 in curly brackets ({6}) after a pattern is like saying, β€œMatch this pattern 6 times.” So the slightly shorter regex \d{3}-\d{7} also matches the correct phone number format.
All the regex functions in Python are in re module so we’ll have to import it first. To create a regex that matches the phone number pattern, enter the following into Jupyter notebook.

Image description
Passing a string value representing our regex to re.compile returns a Regex pattern object which is stored in phoneNumberRegex. Then we call search() on phoneNumberRegex and pass search() the string we want to search a match for. The search() method will return None if the regex pattern is not found in the string. If the pattern is found, the search() method returns a Match() object. Match objects have a group() method that will return the actual matched text from the searched string. The result of the search gets stored in the variable output. In this example, we know that our pattern will be found in the string, so we know that a Match object Weill be returned. Knowing that output contains a Match object and not the null value None, we can call group() on output to return the match. Writing output.group() inside our print statement displays the whole match, 524-2768494.

Oldest comments (0)

🌚 Friends don't let friends browse without dark mode.

Sorry, it's true.