I needed to augment textual data and tutorials on this topic are scarce. So I'm writing this post to share how I augmented my data using NLTK and python.
|this article is part of a serie about machine learning for Kormos|
|ML and text processing on emails|
|text data augmentation: synonym replacement (you are here)|
Our data is a set of emails mostly written in french and english. I'm building a model that predict if an email corresponds to a website the user is subscribed to.
Hence we have 2 classes represented by a boolean named isAccount.
However our dataset is very unbalanced:
Generating new data is time-consuming because our data is tagged by hand. Hence Data Augmentation seems to be a good solution.
Since our model is basically looking for specific keywords, Synonym replacement is a good way to create new useful data.
Synonym replacement is a method of data augmentation which consists of remplacing words of a sentence with synonyms.
Let's have a look at how to find synonyms using NLTK's wordnet
nltk.download('wordnet') nltk.download('punkt') from nltk.corpus import wordnet wordnet.synsets("subscribe")
gives us a list of synsets:
[Synset('subscribe.v.01'), Synset('sign.v.01'), Synset('subscribe.v.03'), Synset('pledge.v.02'), Synset('subscribe.v.05')]
Afterwards we can get the words in each synsets with lemma_names()
Hence I made this basic function to get all synonyms for any english word:
from collections import OrderedDict from nltk.tokenize import word_tokenize def find_synonyms(word): synonyms =  for synset in wordnet.synsets(word): for syn in synset.lemma_names(): synonyms.append(syn) # using this to drop duplicates while maintaining word order (closest synonyms comes first) synonyms_without_duplicates = list(OrderedDict.fromkeys(synonyms)) return synonyms_without_duplicates find_synonyms("subscribe")
the results for the word "subscribe" is:
['subscribe', 'sign', 'support', 'pledge', 'subscribe_to', 'take']
Some words have a lot of synonyms (50 for "support"!), hence I only take the 6 first synonyms given by wordnet.
I also noticed how short words tends to have inadequate synonyms (in context), like "iodine" for "I". Hence I ignore words shorted than 3 characters.
Some synonymes are composed of several words separated by an underscore ('_'), that's why I replace this character by a whitespace character.
Here is my function generating new sentences by doing one-word replacements:
def create_set_of_new_sentences(sentence, max_syn_per_word = 6): new_sentences =  for word in word_tokenize(sentence): if len(word)<=3 : continue for synonym in find_synonyms(word)[0:max_syn_per_word]: synonym = synonym.replace('_', ' ') #restore space character new_sentence = sentence.replace(word,synonym) new_sentences.append(new_sentence) return new_sentences
For those interested in how to merge the original data with the generated data, here is the function I wrote for that:
the argument 'column' specify which field of you dataframe you want to augment.
def data_augment_synonym_replacement(data, column='subject'): generated_data = pd.DataFrame(, columns=data.columns) for index in data.index: text_to_augment = data[column][index] for generated_sentence in create_set_of_new_sentences(text_to_augment): new_entry = data.loc[[index]] new_entry[column] = generated_sentence generated_data=generated_data.append(new_entry) generated_data_df = generated_data.drop_duplicates() augmented_data= pd.concat([data.loc[:],generated_data_df], ignore_index=True) return augmented_data
My original dataset lacked data points where isAccount is False (only 30 lines!). By applying this data augmentation method I now have 298 emails of this class, hence multiplying by 10 the number of data points.
I noticed that this scale down the impact of mail incorrectly marked as written in english, because wordnet don't give synonyms to non-english words. Hence these data points are not augmented.
My method doesn't ensure that the structure of the sentence is preserved. For example: a verb can be replacement by a noun.
I haven't implemented a maximum number of sentences generated for each datapoint, hence my method will generate more data for longer sentences. This may cause overfitting.
While looking for tools to perform data augmentation, I found TextAttack, defined by its authors as a Python framework for adversarial attacks and data augmentation in NLP.
I had compatibility errors when trying to use it on my Google Colab but this is promising and worth looking into.
Taken from their documentation, here is the basic code to have it running:
!pip install textattack -q from textattack.augmentation import WordNetAugmenter augmenter = WordNetAugmenter() s = 'What I cannot create, I do not understand.' augmenter.augment(s)
the results seems similar to what I have done with wordnet, far from perfect but usable.
augmenter.augment(s) return a big list. Among this list the best result is 'What I cannot create, I do not comprehend.' but we see that some meaning is lost, for example: 'What I cannot creating, I do not understand.'
I hope this post will help someone to better understand data augmentation for text data.
If you have any feedback to give, I'd be grateful if you take a few minutes to comment!
I'm especially interested in finding ways to find synonyms in other languages than English.