Hi DEV community!
I have a project idea that I would like to dive into but I need your help defining some major points.
The project is about matching strings to the most similar string in a different dataset(tables) in Python.
Mark Smith/New York City, USA
Mirko Smirk/New York City, USA
John E. Doe/Paris, USA
Jane Doe/Paris, France
Mirko S./NYC, US
Mark S./NYC, US
Jane D/Paris, France
J. Doe/Paris, US
The idea is to match each person with itself on the other table.
For example row 1 of the first table would correctly match with row 2 of the second table.
As you can see there are no perfect matches and sometimes it may match the first column but not the second. The second column should have more weight in the decision.
I first thought of an approach based on Levenshtein Distance to calculate the differences between sequences using FuzzyWuzzy package. But I encountered many issues like the following:
Franklin Delano Roosevelt
Franklin Da Turtle
Levenshtein Distance would say that Franklin Da Turtle is a better match than Franklin Delano Roosevelt for FDR since it has fewer characters to compare.
In real-life data, I will have many columns to validate the accuracy of the decision but I'm stuck anyway.
Is there a better approach? I first thought of some vector usage of NLP but isn't my area of expertise.
Any ideas will be well received.
Leave your comment if you know something better than FuzzyWuzzy approach.