Processing Hindi text with spaCy(2): Finding Synonyms

#machinelearning #python

In this post, we will explore word embedding and how can we used them to determine similarities for words, sentences and documents.

So, let's use spacy to convert raw text into spaCy docs/tokens and look at the vector embeddings.

from spacy.lang.hi import Hindi 
nlp = Hindi()
sent1 = 'मुझे भोजन पसंद है।'
doc = nlp(sent1)
doc[0].vector
# array([], dtype=float32)

Oops! There is no vector corresponding to the token. As we can see that there are no word embeddings available for Hindi words. Luckily, there are word embeddings available online under fasttext project by facebook. So, we will download them and load that in spaCy.

import requests 
url = "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.vec.gz"
r = requests.get(url, allow_redirects=True)
fpath = url.split("/")[-1]
with open(fpath, "wb") as fw:
  fw.write(r.content)

The word-vector file is about 1 GB in size. So, it will take some time to download.
Let's see how we can use external word embeddings in spaCy
Here is a link to spaCy documentation on how to do this https://spacy.io/usage/vectors-similarity#converting

Once word-vectors are downloaded, let's load them into spaCy model on command line

python -m spacy init-model hi ./hi_vectors_wiki_lg --vectors-loc cc.hi.300.vec.gz

Let's load the model now in spacy to do some work
import spacy

nlp_hi = spacy.load("./hi_vectors_wiki_lg")
doc = nlp_hi(sent1)
doc[0].vector

Now, we see that the vector is available to use in spaCy. Let's use these embedding to determine similarity of two sentences. Let's use these vectors to compare two very similar sentences

sent2 = 'मैं ऐसे भोजन की सराहना करता हूं जिसका स्वाद अच्छा हो।'
doc1 = nlp_hi(sent1)
doc2 = nlp_hi(sent2)

# Both the sent1 and sent2 are very similar, so, we expect their similarity score to be high
doc1.similarity(doc2) # prints 0.86

Now, let's use these embeddings to find synonyms of a word.

def get_similar_words(word):
  vector = word.vector
  results = nlp_hi.vocab.vectors.most_similar(vector.reshape(1, 300))

  ret = []
  for result in results:    
    try:
      candidate = nlp_hi.vocab[result[0][0]]
      ret.append(candidate.text)
    except KeyError:
      pass
    return ret
get_similar_words(doc[1]) # prints ['भोजन']

That's not very useful.
Maybe word vectors are very sparse and trained on very small vocabulary.
Let's look into nltk library to see if we can use Hindi WordNet to find similar words of a word. However, NLTK documentation mentions that they don't support hin language yet. So, the search continues.

After a bit of googling, I found out that a research group at IITB has been developing WordNet for Indian languages for quite a while
Checkout link more details.
They published a python library pyiwn for easy accessibility. They haven't yet put it in nltk yet because coverage of Hindi synsets isn't enough to be integrated in NLTK yet.
With that, Let's install this library

pip install pyiwn

import pyiwn 
iwn = pyiwn.IndoWordNet(lang=pyiwn.Language.HINDI)
aam_all_synsets = iwn.synsets('आम') # Mango
aam_all_synsets

# [Synset('कच्चा.adjective.2283'),
# Synset('अधपका.adjective.2697'),
# Synset('आम.noun.3462'),
# Synset('आम.noun.3463'),
# Synset('सामान्य.adjective.3468'),
# Synset('सामूहिक.adjective.3469'),
# Synset('आँव.noun.6253'),
# Synset('आँव.noun.8446'),
# Synset('आम.adjective.39736')]

It's very interesting to see that our synsets of the word include both meaning of the word: Mango and common. Let's pick one synset and different synonyms in the synset

aam = aam_all_synsets[2]

# Let's took at the definition 
aam.gloss()
# prints 'एक फल जो खाया या चूसा जाता है'

# This will print examples where the word is being used
aam.examples()
# ['तोता पेड़ पर बैठकर आम खा रहा है ।',
# 'शास्त्रों ने आम को इंद्रासनी फल की संज्ञा दी है ।']

# Now, let's look at the synonyms for the word 
aam.lemma_names()
# ['आम',
# 'आम्र',
# 'अंब',
# 'अम्ब',
# 'आँब',
# 'आंब',
# 'रसाल',
# 'च्यूत',
# 'प्रियांबु',
# 'प्रियाम्बु',
# 'केशवायुध',
# 'कामायुध',
# 'कामशर',
# 'कामांग']

Let's print some Hyponyms for our synset
A is a Hyponym of B if A is a type of B. For example pigeon is a bird, so pigeon is a hyponym of Bird

iwn.synset_relation(aam, pyiwn.SynsetRelations.HYPONYMY)[:5]
# [Synset('सफेदा.noun.1294'),
# Synset('अंबिया.noun.2888'),
# Synset('सिंदूरिया.noun.8636'),
# Synset('जरदालू.noun.4724'),
# Synset('तोतापरी.noun.6892')]

Conclusion

Now that we have played around with wordnet for a while. Let's recap what a WordNet is. WordNet aims to store the meaning of words along with relationships between words. So, in a sense Wordnet = Language Dictionary + Thesauras + Hierarchical IS-A relationships for nouns + More.

Note: If you want to play around with the notebooks, you can click the link below