DEV Community

Frank
Frank

Posted on

Need help python text matching

Hi DEV community!

I have a project idea that I would like to dive into but I need your help defining some major points.

The project is about matching strings to the most similar string in a different dataset(tables) in Python.

For example:
Table 1
name/city, country
Mark Smith/New York City, USA
Mirko Smirk/New York City, USA
John E. Doe/Paris, USA
Jane Doe/Paris, France

Table 2
name/city, country
Mirko S./NYC, US
Mark S./NYC, US
Jane D/Paris, France
J. Doe/Paris, US

The idea is to match each person with itself on the other table.
For example row 1 of the first table would correctly match with row 2 of the second table.

As you can see there are no perfect matches and sometimes it may match the first column but not the second. The second column should have more weight in the decision.

I first thought of an approach based on Levenshtein Distance to calculate the differences between sequences using FuzzyWuzzy package. But I encountered many issues like the following:
Table 1:
Franklin Delano Roosevelt
Franklin Da Turtle

Table 2:
FDR

Levenshtein Distance would say that Franklin Da Turtle is a better match than Franklin Delano Roosevelt for FDR since it has fewer characters to compare.

In real-life data, I will have many columns to validate the accuracy of the decision but I'm stuck anyway.

Is there a better approach? I first thought of some vector usage of NLP but isn't my area of expertise.

Any ideas will be well received.

Leave your comment if you know something better than FuzzyWuzzy approach.

Top comments (4)

Collapse
 
muhammadaadil50 profile image
BADSHAH

I am also a noob.. But something in my head i wants to share may help you.. Alright make a dictionary of correct words you desires.. Then take each cell and match it with your dictionary and count accuracy.. If accuracy is good enough replace it with your dictionary word.. Maybe it helps..

Collapse
 
siawayforward profile image
Sia

Hi there, Frank

A couple of questions (that I had) before you proceed with the below proposed:

  • Does the first table always have the long version?
  • Is the abbreviated form more or less the same (by rule)?
  • How much more would you need to expand the rules if #2 is not true

Idea here assumes the answers to these questions are yes -> yes -> no

You could make a couple of small methods (or just one if this doesn't extend to more variations) that does two things:

  • abbreviates every word in table one based on the common rule(s) e.g. your examples show initial abbreviation rule and fn/last initial and city abbreviation rule using case and string methods in Python
  • uses string methods to parse through the returned value (from table 1 transformation) and compare it to the table 2 value

Again, this assumes the rules do not change too much for table 1 -> table 2 form. Hope it helps. If not, try to abbreviate, then use regex module to remove punctuation and check the character sequence (?maybe). I think NLTK might be used if you have a huuuuuuuuge dataset with collocation finding and measures. But yes, if you don't have to go there yet, a method should do.

Collapse
 
biaus_ profile image
Frank

Unfortunately I oversimplified the problem.
In real data I will have to cross a list of hotel names with my client list of hotels.
Maybe my list says "Hotel Pennsylvania" and the other list says "Hotel Pennsylvania New York".

It's not that I will always have the full name, maybe my clients have a fuller version which is OK.
In order to match this 2 hotels I will use more columns like city and country.

Collapse
 
siawayforward profile image
Sia • Edited

Gotcha! Then you might be able to combine the two. NLTK has a collocations module where you are able to see if a word or phrase is in a block of text and get that word along with n closest terms to it on either side. If you have both full words and abbreviations e.g. NYC and New York City, you could use a method to transform and then use collocations to check for both forms of something from table 1 in table 2. If I remember correctly, it is part of nltk.collocations. There are methods for bi, tri, quad grams. Still new here so not entirely sure how to share a code snippet, but I hope this jolts some ideas.
P.S. for conjoined words that you may want to split (if there is cleaning involved), there is a wordninja module that does this: e.g. iamtiredtoday -> I am tired today.
Good luck!