DEV Community

Cover image for Processing Hindi text with spaCy(2): Finding Synonyms
Rahul Gupta
Rahul Gupta

Posted on

Processing Hindi text with spaCy(2): Finding Synonyms

In this post, we will explore word embedding and how can we used them to determine similarities for words, sentences and documents.

So, let's use spacy to convert raw text into spaCy docs/tokens and look at the vector embeddings.

from spacy.lang.hi import Hindi 
nlp = Hindi()
sent1 = 'मुझे भोजन पसंद है।'
doc = nlp(sent1)
doc[0].vector
# array([], dtype=float32)

Oops! There is no vector corresponding to the token. As we can see that there are no word embeddings available for Hindi words. Luckily, there are word embeddings available online under fasttext project by facebook. So, we will download them and load that in spaCy.

import requests 
url = "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.vec.gz"
r = requests.get(url, allow_redirects=True)
fpath = url.split("/")[-1]
with open(fpath, "wb") as fw:
  fw.write(r.content)

The word-vector file is about 1 GB in size. So, it will take some time to download.
Let's see how we can use external word embeddings in spaCy
Here is a link to spaCy documentation on how to do this https://spacy.io/usage/vectors-similarity#converting

Once word-vectors are downloaded, let's load them into spaCy model on command line

python -m spacy init-model hi ./hi_vectors_wiki_lg --vectors-loc cc.hi.300.vec.gz

Let's load the model now in spacy to do some work
import spacy

nlp_hi = spacy.load("./hi_vectors_wiki_lg")
doc = nlp_hi(sent1)
doc[0].vector

Now, we see that the vector is available to use in spaCy. Let's use these embedding to determine similarity of two sentences. Let's use these vectors to compare two very similar sentences

sent2 = 'मैं ऐसे भोजन की सराहना करता हूं जिसका स्वाद अच्छा हो।'
doc1 = nlp_hi(sent1)
doc2 = nlp_hi(sent2)

# Both the sent1 and sent2 are very similar, so, we expect their similarity score to be high
doc1.similarity(doc2) # prints 0.86

Now, let's use these embeddings to find synonyms of a word.

def get_similar_words(word):
  vector = word.vector
  results = nlp_hi.vocab.vectors.most_similar(vector.reshape(1, 300))

  ret = []
  for result in results:    
    try:
      candidate = nlp_hi.vocab[result[0][0]]
      ret.append(candidate.text)
    except KeyError:
      pass
    return ret
get_similar_words(doc[1]) # prints ['भोजन']

That's not very useful.
Maybe word vectors are very sparse and trained on very small vocabulary.
Let's look into nltk library to see if we can use Hindi WordNet to find similar words of a word. However, NLTK documentation mentions that they don't support hin language yet. So, the search continues.

After a bit of googling, I found out that a research group at IITB has been developing WordNet for Indian languages for quite a while
Checkout this link more details.
They published a python library pyiwn for easy accessibility. They haven't yet put it in nltk yet because coverage of Hindi synsets isn't enough to be integrated in NLTK yet.
With that, Let's install this library

pip install pyiwn 
import pyiwn 
iwn = pyiwn.IndoWordNet(lang=pyiwn.Language.HINDI)
aam_all_synsets = iwn.synsets('आम') # Mango
aam_all_synsets

# [Synset('कच्चा.adjective.2283'),
# Synset('अधपका.adjective.2697'),
# Synset('आम.noun.3462'),
# Synset('आम.noun.3463'),
# Synset('सामान्य.adjective.3468'),
# Synset('सामूहिक.adjective.3469'),
# Synset('आँव.noun.6253'),
# Synset('आँव.noun.8446'),
# Synset('आम.adjective.39736')]

It's very interesting to see that our synsets of the word include both meaning of the word: Mango and common. Let's pick one synset and different synonyms in the synset

aam = aam_all_synsets[2]

# Let's took at the definition 
aam.gloss()
# prints 'एक फल जो खाया या चूसा जाता है'

# This will print examples where the word is being used
aam.examples()
# ['तोता पेड़ पर बैठकर आम खा रहा है ।',
# 'शास्त्रों ने आम को इंद्रासनी फल की संज्ञा दी है ।']

# Now, let's look at the synonyms for the word 
aam.lemma_names()
# ['आम',
# 'आम्र',
# 'अंब',
# 'अम्ब',
# 'आँब',
# 'आंब',
# 'रसाल',
# 'च्यूत',
# 'प्रियांबु',
# 'प्रियाम्बु',
# 'केशवायुध',
# 'कामायुध',
# 'कामशर',
# 'कामांग']

Let's print some Hyponyms for our synset
A is a Hyponym of B if A is a type of B. For example pigeon is a bird, so pigeon is a hyponym of Bird

iwn.synset_relation(aam, pyiwn.SynsetRelations.HYPONYMY)[:5]
# [Synset('सफेदा.noun.1294'),
# Synset('अंबिया.noun.2888'),
# Synset('सिंदूरिया.noun.8636'),
# Synset('जरदालू.noun.4724'),
# Synset('तोतापरी.noun.6892')]

Conclusion

Now that we have played around with wordnet for a while. Let's recap what a WordNet is. WordNet aims to store the meaning of words along with relationships between words. So, in a sense Wordnet = Language Dictionary + Thesauras + Hierarchical IS-A relationships for nouns + More.

Note: If you want to play around with the notebooks, you can click the link below

Open word-embeddings-with-spacy in Colab

Open synonyms-with-pyiwn in Colab

Top comments (2)

Collapse
 
asli0dude profile image
asli0dude

python -m spacy init-model hi ./hi_vectors_wiki_lg --vectors-loc cc.hi.300.vec.gz

This line is giving error in google collab, I tried changing init-model to init but then it gives error No such command hi. What should I do ?

Collapse
 
chayandhaddha profile image
Chayan Dhaddha

How to find antonyms for a given word in Hindi language?