Discussion on: Need help python text matching

View post

Sia • Jul 10 '20

Hi there, Frank

A couple of questions (that I had) before you proceed with the below proposed:

Does the first table always have the long version?
Is the abbreviated form more or less the same (by rule)?
How much more would you need to expand the rules if #2 is not true

Idea here assumes the answers to these questions are yes -> yes -> no

You could make a couple of small methods (or just one if this doesn't extend to more variations) that does two things:

abbreviates every word in table one based on the common rule(s) e.g. your examples show initial abbreviation rule and fn/last initial and city abbreviation rule using case and string methods in Python
uses string methods to parse through the returned value (from table 1 transformation) and compare it to the table 2 value

Again, this assumes the rules do not change too much for table 1 -> table 2 form. Hope it helps. If not, try to abbreviate, then use regex module to remove punctuation and check the character sequence (?maybe). I think NLTK might be used if you have a huuuuuuuuge dataset with collocation finding and measures. But yes, if you don't have to go there yet, a method should do.

Frank • Jul 10 '20

Unfortunately I oversimplified the problem.
In real data I will have to cross a list of hotel names with my client list of hotels.
Maybe my list says "Hotel Pennsylvania" and the other list says "Hotel Pennsylvania New York".

It's not that I will always have the full name, maybe my clients have a fuller version which is OK.
In order to match this 2 hotels I will use more columns like city and country.

Sia • Jul 12 '20 • Edited

Gotcha! Then you might be able to combine the two. NLTK has a collocations module where you are able to see if a word or phrase is in a block of text and get that word along with n closest terms to it on either side. If you have both full words and abbreviations e.g. NYC and New York City, you could use a method to transform and then use collocations to check for both forms of something from table 1 in table 2. If I remember correctly, it is part of nltk.collocations. There are methods for bi, tri, quad grams. Still new here so not entirely sure how to share a code snippet, but I hope this jolts some ideas.
P.S. for conjoined words that you may want to split (if there is cleaning involved), there is a wordninja module that does this: e.g. iamtiredtoday -> I am tired today.
Good luck!