This article was originally written on the Programmer Backpack blog.
A Word Cloud or a Tag Cloud is a data visualization technique where words from a given text are displayed in a chart, with the more important words being written with bigger, bold fonts, while less important words are displayed with smaller, thinner fonts or not displayed at all.
Lots of Natural Language Processing projects can benefit from the use of Word Clouds because this tool can give you a quick glance at what the text you are analyzing is about and because it can help you, as a developer, or your audience, to understand the context of your dataset.
Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.
In today's project we are going to download the text from a Wikipedia page, and then generate a word cloud and play with it for a bit - changing colors, removing stopwords and saving the wordcloud to a file.
We need to install a few packages before we begin.
pip3 install wikipedia pip3 install wordcloud
We will also need matplotlib, but this is installed automatically by wordcloud, if you don't have on your local.
Since we've installed our packages, we might as well import them into our project.
import wikipedia from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt
Now we are going to use the wikipedia package to download the text content. All we need for this is to provide the title of the page. We are going to fetch the text for the Natural Language Processing page on Wikipedia.
page = wikipedia.page("Natural Language Processing") text = page.content
Next we can use this text to generate and display the word cloud.
cloud = WordCloud().generate(text) plt.figure(figsize=(8, 8), facecolor=None) plt.imshow(cloud, interpolation="bilinear") plt.axis("off") plt.tight_layout(pad=0) plt.show()
Understanding this word cloud is very intuitive, you can observe that these words are very common when we read something about Natural Language Processing. Let's see another example, this time with the article about Machine Learning.
Again, very obvious results, but we can use this nice technique of representation to make our audience understand a text or a dataset, in the case of a Natural Language Processing task.
You can do lots of personalization on a word cloud: changing fonts, changing colors and the way they are displayed. For more options, you can always check the documentation of the package.
To demonstrate how easy it is, let's at least change the background color. For that, we only need to add a new parameter to the WordCloud constructor.
cloud = WordCloud(background_color='white').generate(text)
Stop words are words that are very frequently used in a given language and add absolutely no meaning to the text(because other texts will also contain these words, a stop word does not tell us anything about our text corpus).
The wordcloud package already removes the most common stopwords for us. But if you think that your text corpus has some stop words that are not removed, you can always append new words that will be removed. Here's how:
# Additional import here from wordcloud import WordCloud, STOPWORDS ########### stopwords = set(STOPWORDS) stopwords.update(["stopword1", "stopword2"]) cloud = WordCloud(stopwords=stopwords, background_color='white').generate(text)
We can also save our word cloud with a single line of code.
That was quick! We've generated our word cloud in very few lines of code and we can use this to better understand our data whenever we work on some Natural Language Processing project.
Thank you so much for reading this article! Interested in more? Follow me on Twitter at @b_dmarius and I'll post there every new article.