Discussion on: Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.

View post

Those days I often work on NLU so that sounds pretty good.

It would be interesting to apply a normalizing function on the input text, just for matching. Something like:

from flashtext.keyword import KeywordProcessor
from unidecode import unidecode

def normalize(c):
    return unidecode(c).lower()

keyword_processor = KeywordProcessor()
keyword_processor.set_normalizer(normalize)
keyword_processor.add_keyword('remy', 'Rémy')
keyword_processor.add_keyword('nicolas', 'Nicolas'))

new_sentence = keyword_processor.replace_keywords(
    'My name is Remy and unlike nicolas it is written with an accent'
)

This would help normalizing the writing of keywords without screwing the whole sentence.

Vikash Singh • Dec 11 '17

@remy : Sorry, I didn't get that completely. Can you please elaborate on the expected output and how normalise function is making it happen?

Rémy 🤖 • Dec 11 '17

Suppose that your input is one of

My name is remy
My name is RÉMY
My name is Rémy

Then your output would be

My name is Rémy

It's like when you say you want to replace different instances of JavaScript. If you want JavaScript formated the same way all the time then you can use this technique to achieve that.

Vikash Singh • Dec 11 '17

That can already be done right?

print(kp.replace_keywords(normalize('My name is remy')))
print(kp.replace_keywords(normalize('My name is RÉMY')))
print(kp.replace_keywords(normalize('My name is Rémy')))

output:
my name is Rémy
my name is Rémy
my name is Rémy

Rémy 🤖 • Dec 11 '17

Yup but then you're getting my name is Rémy instead of My name is Rémy.

Also it would allow to process the string without holding it several times in memory (and thus possibly to work on a stream). If you're dealing with big texts it might be interesting as well

I don't have a direct application right now though, but from the things I usually do I'm guessing it would make sense.

Vikash Singh • Dec 11 '17

Ok Remy, Btw, if we change normalize method to not lower the text your requirement will be solved.

def normalize(c):
    return unidecode(c)

output:
My name is Rémy
My name is Rémy
My name is Rémy

Also, if I call normalize from within FlashText or outside FlashText it will be the same amount of memory and computation.

Still, I will keep looking for a possible use case for your suggestion. Thanks for bringing it up :) :)