Geazi Anc

Posted on Jan 11, 2023

PySpark: A brief analysis to the most common words in Dracula, by Bram Stoker

#python #dataengineering #spark #datascience

Note: this article is also available in portuguese 🌎.

A landmark in Gothic literature, the iconic novel Dracula, written by Bram Stoker in 1897, stirs the emotions of people across the world. Today, to introduce Spark's new concepts and features, we will develop a brief notebook to analyze the most common words in this classic book 🧛🏼‍♂️.

To do this, we will write a notebook in Google Colab, a cloud service built by Google to encourage machine learning and artificial intelligence researches.

This notebook is also available in my GitHub 😉.

This novel was obtained through Project Gutenberg, a digital library that centralizes public books around the world.

Before get start

Before start, we need to install PySpark library.

The PySpark is the official API of Apache Spark for Python. We will develop our data analysis using it 🎲.

So, create a new code cell in Colab and add the following line:

!pip install pyspark

Step one: running Apache Spark

After the installation is complete, we need to run Apache Spark. To do this, create a new code cell and add the following code block:

         from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .appName("The top most common words in Dracula, by Bram Stoker")
         .getOrCreate()
         )

Step two: downloading and reading

In this step, we will download the novel from Guttenberg project and, after that, load it using PySpark.

We will use wget tool to do this, passing the URL book for it and saving it in local directory, and renaming to Dracula – Bram Stoker.txt.

Again, create a new code cell in Colab and add the following code line:

!wget https: // www.gutenberg.org/cache/epub/345/pg345.txt -O "Dracula - Bram Stoker.txt"

Step three: stopwords downloading

In this section, we will download the list of stopwords used in English language. These stops words normally include prepositions, particles, interjections, unions, adverbs, pronouns, introductory words, numbers from 0 to 9 (unambiguous), other frequently used official, independent parts of speech, symbols, punctuation. Relatively recently, this list was supplemented by such commonly used on the Internet sequences of symbols as www, com, http, etc.

This list was obtained through CountWordsFree, a website that centralizes the stopwords used in many languages across the world.

get to work! Create a new code cell in Colab and add the following code line:

!wget https://countwordsfree.com/stopwords/english/txt -O "stop_words_english.txt"

After that, let’s load the book using Spark. Create a new code cell and add the following code block:

book = spark.read.text("Dracula - Bram Stoker.txt")

And let’s load the stopwords as well. The stopwords will are stored in a list, in stopwords variable.

with open("stop_words_english.txt", "r") as f:
    text = f.read()
    stopwords = text.splitlines()

len(stopwords), stopwords[:15]

Output

(851,
 ['able',
  'about',
  'above',
  'abroad',
  'according',
  'accordingly',
  'across',
  'actually',
  'adj',
  'after',
  'afterwards',
  'again',
  'against',
  'ago',
  'ahead']t)

Step four: extracting words

After load is completed, we need to extract the words to a dataframe column.

To do this, use the split function to each line, will split them using blank spaces between them. The result is a list of words.

from pyspark.sql.functions import split

lines = book.select(split(book.value, " ").alias("line"))
lines.show(5)

Output

+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[most, other, par...|
|[whatsoever., You...|
+--------------------+
only showing top 5 rows

Step five: exploding list words

Now, let’s convert this list of words in dataframe column, using explode function.

from pyspark.sql.functions import explode, col

words = lines.select(explode(col("line")).alias("word"))
words.show(15)

Output

+---------+
|     word|
+---------+
|      The|
|  Project|
|Gutenberg|
|    eBook|
|       of|
| Dracula,|
|       by|
|     Bram|
|   Stoker|
|         |
|     This|
|    eBook|
|       is|
|      for|
|      the|
+---------+
only showing top 15 rows

Step six: words to lowercase

This is a simple step. We don't want the same word to be different because of capital letters, so we convert these words to lowercase, using lower function.

from pyspark.sql.functions import lower

words_lower = words.select(lower(col("word")).alias("word_lower"))
words_lower.show()

Output

+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|  dracula,|
|        by|
|      bram|
|    stoker|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
|  anywhere|
|        in|
+----------+
only showing top 20 rows

Step seven: removing punctuations

so that the same word is not different because of the punctuation at the end of them, is necessary to remove these punctuations.

We'll do this using the regexp_extract function, which extracts words from a string using a regex.

from pyspark.sql.functions import regexp_extract

words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word")
)

words_clean.show()

Output

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|         |
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
+---------+
only showing top 20 rows

Step eight: removing null values

However, how you see, there are null values yet, in other words, blank spaces.

It is necessary remove them so that these blanks values are not analyzed.

words_nonull = words_clean.filter(col("word") != "")
words_nonull.show()

Output

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
|      the|
+---------+
only showing top 20 rows

Step nine: removing stopwords

We are almost there! The last step is removes the stopwords so that, again, these words are not analyzed.

words_without_stopwords = words_nonull.filter(
    ~words_nonull.word.isin(stopwords))

words_count_before_removing = words_nonull.count()
words_count_after_removing = words_without_stopwords.count()

words_count_before_removing, words_count_after_removing

Output

(163399, 50222)

Step ten: analyzing the most common words in Dracula, finally!

And, finally, our data are completely cleared. So, now we could to analyze the most common words in our book.

At first, we’ll group the words and after use an aggregate function to count them.

words_count = (words_without_stopwords.groupby("word")
               .count()
               .orderBy("count", ascending=False)
               )

After, show the top 20 most common words. This value may be changed through rank variable.

rank = 20
words_count.show(rank)

Output

+--------+-----+
|    word|count|
+--------+-----+
|    time|  381|
| helsing|  323|
|     van|  322|
|    lucy|  297|
|    good|  256|
|     man|  255|
|    mina|  240|
|    dear|  224|
|   night|  224|
|    hand|  209|
|    room|  207|
|    face|  206|
|jonathan|  206|
|   count|  197|
|    door|  197|
|   sleep|  192|
|    poor|  191|
|    eyes|  188|
|    work|  188|
|      dr|  187|
+--------+-----+
only showing top 20 rows

Conclusion

That’s all for now, folks! In this article, we analyzed the most common words in Dracula, written by Bram Stoker. To do this, we cleared the words: removing punctuations; converting from uppercase letters to lowercase; and removing stopwords.

I hope you enjoyed it. Keep those stakes sharp, watch out for the shadows that walk at night, and see you in next time 🧛🏼‍♂️🍷.

bibliography

RIOUX, Jonathan. Data Analysis with Python and PySpark.

STOKER, Bram. Dracula.

DEV Community

PySpark: A brief analysis to the most common words in Dracula, by Bram Stoker

Before get start

Step one: running Apache Spark

Step two: downloading and reading

Step three: stopwords downloading

Step four: extracting words

Step five: exploding list words

Step six: words to lowercase

Step seven: removing punctuations

Step eight: removing null values

Step nine: removing stopwords

Step ten: analyzing the most common words in Dracula, finally!

Conclusion

bibliography

Top comments (0)

Read next

Clean architecture: Where to start ?

Guide to 24 Essential Open Source Projects from Package Managers to AI apps

Automated crypto price tracking using GMAIL and Python

YOLOv11: A New Breakthrough in Document Layout Analysis