Processing Hindi text with SpaCy

#machinelearning #python

Note: I understand that this post can be hard to follow for non-Hindi readers, so I have included English translation of those words after the Hindi words.

Tons of resources are available for processing English(and most roman languages) text, but not so much for other languages. In this post, we will explore How we can use spaCy for processing Hindi text.

Here we will be using spaCy module for processing and indic-nlp-datasets for getting data. We will be using text from Devdas novel by Sharat Chandra for demonstrating common NLP tasks here.

Let's install these two libraries.

pip install spacy 
pip install indic-nlp-datasets

from idatasets.devdas import load_devdas

devdas = load_devdas()
# devdas.data is a generator of paragraphs
paragraphs = list(devdas.data)
text = " ".join(paragraphs)
words = text.split(" ")

So, words has list of all the words in the novel.

from collections import Counter 
cnt = Counter(words)

cnt.most_common(10)
# print 
# [('के', 696), // of
#  ('ने', 676), 
#  ('नही', 672), // not
#  ('से', 626), // to 
#  ('मे', 562), // in 
#  ('की', 480), // 
#  ('है', 444), // is 
#  ('देवदास', 437),// Devdas
#  ('को', 336), // 's
#  ('पार्वती', 332)] // Parvati

What we see that top words are not specially meaningful, mostly connectors and articles. Let's use the spacy's hindi stop word list to get rid of those.

from spacy.lang.hi import STOP_WORDS as STOP_WORDS_HI
not_stop_words = [word for word in words if word not in set(STOP_WORDS_HI) ]

non_stop_cnt = Counter(non_stop_words)

non_stop_cnt.most_common(10)

# prints 
# [('नही', 782), // not
#  ('देवदास', 472), // Devdas 
#  ('कहा-', 390), // said
#  ('पार्वती', 345), // Parvati
#  ('क्या', 237), // what 
#  ('दिन', 187), // day 
#  ('बात', 168),// Talk 
#  ('तुम', 168), // you
#  ('मै', 160), // I 
#  ('चन्द्रमुखी', 154)] // Chadramukhi

Now we see more interesting words appearing as common words. Three out of these 10 most common words (namely, 'देवदास', 'पार्वती', 'चन्द्रमुखी')[Devdas, Parvati, Chandramukhi] corresponds to three main characters around which whole love-triangle story revolves.

Printing most common word is great, isn't enough to justify a cushy data scientist job. :D So, Let's make it prettier using WordCloud.

from wordcloud import WordCloud

import matplotlib.pyplot as plt

wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

This gives us this plot below.

Wait, where are all the words gone ??

After googling a bit, the github issue below talks about how we needs to devnagri fonts to render the image correctly.
https://github.com/amueller/word_cloud/issues/70

so, we modify the code to accept a custom font file


font="gargi.ttf"

wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
    font_path=font
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

This yields the image below

You may notice that the WordCloud renders the Hindi letters, but it doesn't contain the most frequent words that we saw before. Also, it doesn't have any of the accent("मात्रा"). So, what's happening here ?

The issue below talks about how "\w+" regex pattern doesn't work as expected in languages other than English. An easy work-around is to pass our own regex which matches with all Hindi letters including accents.
https://github.com/amueller/word_cloud/issues/272

So, let's fix that


wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
    regexp=r"[\u0900-\u097F]+", 
    font_path=font
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

This yields the image below.

This looks alright. Few things to note here.

Names of all the prominent characters show up in the word cloud.
"नहीं"(Not) word appear a lot. Which signals that characters are often not in agreement with each other.

Next up, we will talk about how you can do some other tasks such as part of speech analysis, finding names of characters/cities/organzations in a Sentence automatically.

Hope you enjoyed reading it.
If you want to play around with it in colab, checkout the link below.

Top comments (4)

Akshay • Aug 21 '20

अब देवदास को चंद्रमुखी मिल जाएगी। बस शब्द संख्या बढ़ानी होगी। 😊

Good stuff!

SM • Jan 3 '21

This is great stuff. This helped me a lot. One problem still persists, it is that कितनी is कतिनी, किन्तु is कनितु, etc. This इ मात्रा is placed on the next alphabet when it scribed by the library on an image.