DEV Community

Discussion on: Need help python text matching

Collapse
 
siawayforward profile image
Sia

Hi there, Frank

A couple of questions (that I had) before you proceed with the below proposed:

  • Does the first table always have the long version?
  • Is the abbreviated form more or less the same (by rule)?
  • How much more would you need to expand the rules if #2 is not true

Idea here assumes the answers to these questions are yes -> yes -> no

You could make a couple of small methods (or just one if this doesn't extend to more variations) that does two things:

  • abbreviates every word in table one based on the common rule(s) e.g. your examples show initial abbreviation rule and fn/last initial and city abbreviation rule using case and string methods in Python
  • uses string methods to parse through the returned value (from table 1 transformation) and compare it to the table 2 value

Again, this assumes the rules do not change too much for table 1 -> table 2 form. Hope it helps. If not, try to abbreviate, then use regex module to remove punctuation and check the character sequence (?maybe). I think NLTK might be used if you have a huuuuuuuuge dataset with collocation finding and measures. But yes, if you don't have to go there yet, a method should do.

Collapse
 
biaus_ profile image
Frank

Unfortunately I oversimplified the problem.
In real data I will have to cross a list of hotel names with my client list of hotels.
Maybe my list says "Hotel Pennsylvania" and the other list says "Hotel Pennsylvania New York".

It's not that I will always have the full name, maybe my clients have a fuller version which is OK.
In order to match this 2 hotels I will use more columns like city and country.

Collapse
 
siawayforward profile image
Sia • Edited

Gotcha! Then you might be able to combine the two. NLTK has a collocations module where you are able to see if a word or phrase is in a block of text and get that word along with n closest terms to it on either side. If you have both full words and abbreviations e.g. NYC and New York City, you could use a method to transform and then use collocations to check for both forms of something from table 1 in table 2. If I remember correctly, it is part of nltk.collocations. There are methods for bi, tri, quad grams. Still new here so not entirely sure how to share a code snippet, but I hope this jolts some ideas.
P.S. for conjoined words that you may want to split (if there is cleaning involved), there is a wordninja module that does this: e.g. iamtiredtoday -> I am tired today.
Good luck!