DEV Community

Cover image for Processing Hindi text with SpaCy
Rahul Gupta
Rahul Gupta

Posted on

Processing Hindi text with SpaCy

Note: I understand that this post can be hard to follow for non-Hindi readers, so I have included English translation of those words after the Hindi words.

Tons of resources are available for processing English(and most roman languages) text, but not so much for other languages. In this post, we will explore How we can use spaCy for processing Hindi text.

Here we will be using spaCy module for processing and indic-nlp-datasets for getting data. We will be using text from Devdas novel by Sharat Chandra for demonstrating common NLP tasks here.

Let's install these two libraries.

pip install spacy 
pip install indic-nlp-datasets
from idatasets.devdas import load_devdas

devdas = load_devdas()
# devdas.data is a generator of paragraphs
paragraphs = list(devdas.data)
text = " ".join(paragraphs)
words = text.split(" ")

So, words has list of all the words in the novel.

from collections import Counter 
cnt = Counter(words)

cnt.most_common(10)
# print 
# [('के', 696), // of
#  ('ने', 676), 
#  ('नही', 672), // not
#  ('से', 626), // to 
#  ('मे', 562), // in 
#  ('की', 480), // 
#  ('है', 444), // is 
#  ('देवदास', 437),// Devdas
#  ('को', 336), // 's
#  ('पार्वती', 332)] // Parvati

What we see that top words are not specially meaningful, mostly connectors and articles. Let's use the spacy's hindi stop word list to get rid of those.

from spacy.lang.hi import STOP_WORDS as STOP_WORDS_HI
not_stop_words = [word for word in words if word not in set(STOP_WORDS_HI) ]

non_stop_cnt = Counter(non_stop_words)

non_stop_cnt.most_common(10)

# prints 
# [('नही', 782), // not
#  ('देवदास', 472), // Devdas 
#  ('कहा-', 390), // said
#  ('पार्वती', 345), // Parvati
#  ('क्या', 237), // what 
#  ('दिन', 187), // day 
#  ('बात', 168),// Talk 
#  ('तुम', 168), // you
#  ('मै', 160), // I 
#  ('चन्द्रमुखी', 154)] // Chadramukhi

Now we see more interesting words appearing as common words. Three out of these 10 most common words (namely, 'देवदास', 'पार्वती', 'चन्द्रमुखी')[Devdas, Parvati, Chandramukhi] corresponds to three main characters around which whole love-triangle story revolves.

Printing most common word is great, isn't enough to justify a cushy data scientist job. :D So, Let's make it prettier using WordCloud.

from wordcloud import WordCloud

import matplotlib.pyplot as plt

wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

This gives us this plot below.
Alt Text
Wait, where are all the words gone ??

After googling a bit, the github issue below talks about how we needs to devnagri fonts to render the image correctly.
https://github.com/amueller/word_cloud/issues/70

so, we modify the code to accept a custom font file


font="gargi.ttf"

wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
    font_path=font
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

This yields the image below
Alt Text
You may notice that the WordCloud renders the Hindi letters, but it doesn't contain the most frequent words that we saw before. Also, it doesn't have any of the accent("मात्रा"). So, what's happening here ?

The issue below talks about how "\w+" regex pattern doesn't work as expected in languages other than English. An easy work-around is to pass our own regex which matches with all Hindi letters including accents.
https://github.com/amueller/word_cloud/issues/272

So, let's fix that


wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
    regexp=r"[\u0900-\u097F]+", 
    font_path=font
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

This yields the image below.
Alt Text

This looks alright. Few things to note here.

  • Names of all the prominent characters show up in the word cloud.
  • "नहीं"(Not) word appear a lot. Which signals that characters are often not in agreement with each other.

Next up, we will talk about how you can do some other tasks such as part of speech analysis, finding names of characters/cities/organzations in a Sentence automatically.

Hope you enjoyed reading it.
If you want to play around with it in colab, checkout the link below.
Open In Colab

Top comments (4)

Collapse
 
akshayxyz profile image
Akshay

अब देवदास को चंद्रमुखी मिल जाएगी। बस शब्द संख्या बढ़ानी होगी। 😊

Good stuff!

Collapse
 
mauryas profile image
SM

This is great stuff. This helped me a lot. One problem still persists, it is that कितनी is कतिनी, किन्तु is कनितु, etc. This इ मात्रा is placed on the next alphabet when it scribed by the library on an image.

Collapse
 
amananandrai profile image
amananandrai

Thanks I was looking for some Hindi NLP and this post is a great help.

Collapse
 
rahul1990gupta profile image
Rahul Gupta

Thanks @amananandrai
I am glad it was useful to you. I included a colaboratory link, if you want to play around with it.