DEV Community

Geazi Anc
Geazi Anc

Posted on

PySpark: A brief analysis to the most common words in Dracula, by Bram Stoker

Note: this article is also available in portuguese 🌎.

A landmark in Gothic literature, the iconic novel Dracula, written by Bram Stoker in 1897, stirs the emotions of people across the world. Today, to introduce Spark's new concepts and features, we will develop a brief notebook to analyze the most common words in this classic book 🧛🏼‍♂️.

To do this, we will write a notebook in Google Colab, a cloud service built by Google to encourage machine learning and artificial intelligence researches.

This notebook is also available in my GitHub 😉.

This novel was obtained through Project Gutenberg, a digital library that centralizes public books around the world.

Before get start

Before start, we need to install PySpark library.

The PySpark is the official API of Apache Spark for Python. We will develop our data analysis using it 🎲.

So, create a new code cell in Colab and add the following line:

!pip install pyspark
Enter fullscreen mode Exit fullscreen mode

Step one: running Apache Spark

After the installation is complete, we need to run Apache Spark. To do this, create a new code cell and add the following code block:

         from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .appName("The top most common words in Dracula, by Bram Stoker")
         .getOrCreate()
         )
Enter fullscreen mode Exit fullscreen mode

Step two: downloading and reading

In this step, we will download the novel from Guttenberg project and, after that, load it using PySpark.

We will use wget tool to do this, passing the URL book for it and saving it in local directory, and renaming to Dracula – Bram Stoker.txt.

Again, create a new code cell in Colab and add the following code line:

!wget https: // www.gutenberg.org/cache/epub/345/pg345.txt -O "Dracula - Bram Stoker.txt"
Enter fullscreen mode Exit fullscreen mode

Step three: stopwords downloading

In this section, we will download the list of stopwords used in English language. These stops words normally include prepositions, particles, interjections, unions, adverbs, pronouns, introductory words, numbers from 0 to 9 (unambiguous), other frequently used official, independent parts of speech, symbols, punctuation. Relatively recently, this list was supplemented by such commonly used on the Internet sequences of symbols as www, com, http, etc.

This list was obtained through CountWordsFree, a website that centralizes the stopwords used in many languages across the world.

get to work! Create a new code cell in Colab and add the following code line:

!wget https://countwordsfree.com/stopwords/english/txt -O "stop_words_english.txt"
Enter fullscreen mode Exit fullscreen mode

After that, let’s load the book using Spark. Create a new code cell and add the following code block:

book = spark.read.text("Dracula - Bram Stoker.txt")
Enter fullscreen mode Exit fullscreen mode

And let’s load the stopwords as well. The stopwords will are stored in a list, in stopwords variable.

with open("stop_words_english.txt", "r") as f:
    text = f.read()
    stopwords = text.splitlines()

len(stopwords), stopwords[:15]
Enter fullscreen mode Exit fullscreen mode

Output

(851,
 ['able',
  'about',
  'above',
  'abroad',
  'according',
  'accordingly',
  'across',
  'actually',
  'adj',
  'after',
  'afterwards',
  'again',
  'against',
  'ago',
  'ahead']t)
Enter fullscreen mode Exit fullscreen mode

Step four: extracting words

After load is completed, we need to extract the words to a dataframe column.

To do this, use the split function to each line, will split them using blank spaces between them. The result is a list of words.

from pyspark.sql.functions import split

lines = book.select(split(book.value, " ").alias("line"))
lines.show(5)
Enter fullscreen mode Exit fullscreen mode

Output

+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[most, other, par...|
|[whatsoever., You...|
+--------------------+
only showing top 5 rows
Enter fullscreen mode Exit fullscreen mode

Step five: exploding list words

Now, let’s convert this list of words in dataframe column, using explode function.

from pyspark.sql.functions import explode, col

words = lines.select(explode(col("line")).alias("word"))
words.show(15)
Enter fullscreen mode Exit fullscreen mode

Output

+---------+
|     word|
+---------+
|      The|
|  Project|
|Gutenberg|
|    eBook|
|       of|
| Dracula,|
|       by|
|     Bram|
|   Stoker|
|         |
|     This|
|    eBook|
|       is|
|      for|
|      the|
+---------+
only showing top 15 rows
Enter fullscreen mode Exit fullscreen mode

Step six: words to lowercase

This is a simple step. We don't want the same word to be different because of capital letters, so we convert these words to lowercase, using lower function.

from pyspark.sql.functions import lower

words_lower = words.select(lower(col("word")).alias("word_lower"))
words_lower.show()
Enter fullscreen mode Exit fullscreen mode

Output

+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|  dracula,|
|        by|
|      bram|
|    stoker|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
|  anywhere|
|        in|
+----------+
only showing top 20 rows
Enter fullscreen mode Exit fullscreen mode

Step seven: removing punctuations

so that the same word is not different because of the punctuation at the end of them, is necessary to remove these punctuations.

We'll do this using the regexp_extract function, which extracts words from a string using a regex.

from pyspark.sql.functions import regexp_extract

words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word")
)

words_clean.show()
Enter fullscreen mode Exit fullscreen mode

Output

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|         |
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
+---------+
only showing top 20 rows
Enter fullscreen mode Exit fullscreen mode

Step eight: removing null values

However, how you see, there are null values yet, in other words, blank spaces.

It is necessary remove them so that these blanks values are not analyzed.

words_nonull = words_clean.filter(col("word") != "")
words_nonull.show()
Enter fullscreen mode Exit fullscreen mode

Output

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
|      the|
+---------+
only showing top 20 rows
Enter fullscreen mode Exit fullscreen mode

Step nine: removing stopwords

We are almost there! The last step is removes the stopwords so that, again, these words are not analyzed.

words_without_stopwords = words_nonull.filter(
    ~words_nonull.word.isin(stopwords))

words_count_before_removing = words_nonull.count()
words_count_after_removing = words_without_stopwords.count()

words_count_before_removing, words_count_after_removing
Enter fullscreen mode Exit fullscreen mode

Output

(163399, 50222)
Enter fullscreen mode Exit fullscreen mode

Step ten: analyzing the most common words in Dracula, finally!

And, finally, our data are completely cleared. So, now we could to analyze the most common words in our book.

At first, we’ll group the words and after use an aggregate function to count them.

words_count = (words_without_stopwords.groupby("word")
               .count()
               .orderBy("count", ascending=False)
               )
Enter fullscreen mode Exit fullscreen mode

After, show the top 20 most common words. This value may be changed through rank variable.

rank = 20
words_count.show(rank)
Enter fullscreen mode Exit fullscreen mode

Output

+--------+-----+
|    word|count|
+--------+-----+
|    time|  381|
| helsing|  323|
|     van|  322|
|    lucy|  297|
|    good|  256|
|     man|  255|
|    mina|  240|
|    dear|  224|
|   night|  224|
|    hand|  209|
|    room|  207|
|    face|  206|
|jonathan|  206|
|   count|  197|
|    door|  197|
|   sleep|  192|
|    poor|  191|
|    eyes|  188|
|    work|  188|
|      dr|  187|
+--------+-----+
only showing top 20 rows
Enter fullscreen mode Exit fullscreen mode

Conclusion

That’s all for now, folks! In this article, we analyzed the most common words in Dracula, written by Bram Stoker. To do this, we cleared the words: removing punctuations; converting from uppercase letters to lowercase; and removing stopwords.

I hope you enjoyed it. Keep those stakes sharp, watch out for the shadows that walk at night, and see you in next time 🧛🏼‍♂️🍷.

bibliography

RIOUX, Jonathan. Data Analysis with Python and PySpark.

STOKER, Bram. Dracula.

Top comments (0)