Here we have some regular expressions (regex) that can match a majority of names and addresses. Don't directly copy and paste them though, as there's no guarantee on always landing a 100% match only by using them.
Name Regex
We’re first discussing name regex intentionally, just because the it often includes a human name. It’d be clearer for you if we talk about names first.
The regex here can be applied to a first or last name text field. We’ll focus on a data field for human name and ignore details differentiating first names and surnames.
The pattern in more common names like “James,” “William,” “Elizabeth,” “Mary,” are trivial. They can be easily matched with . How about those with more variations? There are plenty of languages with different naming conventions. We’ll try to group different types of names into basic groups:
- Hyphenated names, e.g., Lloyd-Atkinson, Smith-Jones
- Names with apostrophes, e.g., D’Angelo, D’Esposito
- Names with spaces in-between, e.g., Van der Humpton, De Jong, Di Lorenzo
Carry on reading to see how text extraction can be done with a regex.
import re
test = [
"james",
"william",
"elizabeth",
"mary",
"d'angelo",
"andy",
"lloyd-atkinson",
"van der humpton",
"jo",
]
regex = re.compile(
r'^[a-z ,.\'-]+$'
)
print(sum([regex.findall(x) for x in test],[]))
Address Regex
For geographical or language reasons, the format of an address varies all over the world. Here’s a long list describing these formats per country.
Since address format is too varied, it’s impossible for a regex to cover all these patterns. Even if there is one that manages to do so, it’d be very challenging to test, as the testing data set has to be more than enormous.
Our regex for address will only cover some of the common ones in English-speaking countries. It should do the trick for addresses that start with a number, like “123 Sesame Street.” It’s from this discussion thread where it received positive feedback.
import re
test = [
"224 Belmont Street APT 220",
"225 N Belmont St 220",
"123 west 2nd ave",
"4 Saffron Hill Road 1",
# will fail as they don't start with a digit
"Flat A, 2 Second Avenue",
"Upper Level 10 ABC Street"
]
regex = re.compile(
r'^(\d+) ?([A-Za-z](?= ))? (.*?) ([^ ]+?) ?((?<= )APT)? ?((?<= )\d*)?$'
)
print(sum([regex.findall(x) for x in test],[]))
Limitations of Using Regex to Extract Names and Addresses
Dealing with Uncommon Values
While these regexes may be able to validate a large portion of names and addresses, they will likely miss out on some, especially those that are non-English or newly introduced. For example, Spanish or German names weren’t considered thoroughly here, probably because the developer wasn’t familiar with these languages.
No Pattern to Follow
Regex works well against data that has a strict pattern to follow, where neither name nor address belongs to a category. They’re ever-changing, with new instances created every day, along with a massive variation. Regex isn’t really going to do a good job on extracting them. In short, they are not “regular” enough with no intuitive patterns to follow.
Unable to Find the Likeliest Name
Regex also lacks the ability to differentiate to find the “most likely” name. Let’s take a step back and assume there’s a regex R that can extract names flawlessly from documents that scanned via an OCR data extraction service. We want to get the recipient’s name (from a letter from Ann to Mary):
Dear Mary,
How have you been these days? Lately, Tom and I have been planning to travel around the World.
...
...
...
Love,
Ann
There are three names in the letter — Mary, Tom, and Ann. If you use R to find names, you’ll end up with a list of the three names, but you won’t be receiving just Mary, the recipient.
So, how can this be achieved? We can give each name a score based on:
- Its position on the document
- How “naive” it is (i.e., how often it appeared in a training data set)
- Likelihood of a name to be the single target from a training data set
Unable to Differentiate Name and Address
On paper, names and address can be the same thing. “John” can be a name, or it can also be a part of an address, like “John Street”. Regexes don’t have the capability to see this difference and react accordingly. We surely don’t want to have results “Sesame Street” as a name and “Mr. Sherlock Holmes” as a street address!
Well, how can I achieve a better extraction accuracy then? For more details and our proposed solution, please refer this article! Cheers!
Top comments (0)