DEV Community

Discussion on: Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.

Collapse
 
xowap profile image
Rémy 🤖 • Edited

Those days I often work on NLU so that sounds pretty good.

It would be interesting to apply a normalizing function on the input text, just for matching. Something like:

from flashtext.keyword import KeywordProcessor
from unidecode import unidecode

def normalize(c):
    return unidecode(c).lower()

keyword_processor = KeywordProcessor()
keyword_processor.set_normalizer(normalize)
keyword_processor.add_keyword('remy', 'Rémy')
keyword_processor.add_keyword('nicolas', 'Nicolas'))

new_sentence = keyword_processor.replace_keywords(
    'My name is Remy and unlike nicolas it is written with an accent'
)

This would help normalizing the writing of keywords without screwing the whole sentence.

Collapse
 
vi3k6i5 profile image
Vikash Singh

@remy : Sorry, I didn't get that completely. Can you please elaborate on the expected output and how normalise function is making it happen?

Collapse
 
xowap profile image
Rémy 🤖

Suppose that your input is one of

  • My name is remy
  • My name is RÉMY
  • My name is Rémy

Then your output would be

My name is Rémy

It's like when you say you want to replace different instances of JavaScript. If you want JavaScript formated the same way all the time then you can use this technique to achieve that.

Thread Thread
 
vi3k6i5 profile image
Vikash Singh

That can already be done right?

print(kp.replace_keywords(normalize('My name is remy')))
print(kp.replace_keywords(normalize('My name is RÉMY')))
print(kp.replace_keywords(normalize('My name is Rémy')))

output:
my name is Rémy
my name is Rémy
my name is Rémy

Thread Thread
 
xowap profile image
Rémy 🤖

Yup but then you're getting my name is Rémy instead of My name is Rémy.

Also it would allow to process the string without holding it several times in memory (and thus possibly to work on a stream). If you're dealing with big texts it might be interesting as well

I don't have a direct application right now though, but from the things I usually do I'm guessing it would make sense.

Thread Thread
 
vi3k6i5 profile image
Vikash Singh

Ok Remy, Btw, if we change normalize method to not lower the text your requirement will be solved.

def normalize(c):
    return unidecode(c)

output:
My name is Rémy
My name is Rémy
My name is Rémy

Also, if I call normalize from within FlashText or outside FlashText it will be the same amount of memory and computation.

Still, I will keep looking for a possible use case for your suggestion. Thanks for bringing it up :) :)