Can Python unite the nation?
Yesterday we saw our country's 45th successful transfer of the presidency.
This marked the end of a highly contested election during which our nation at times felt more divided than ever.
But as I sat in my living room today with my parents and watched Biden’s inaugural address, I felt hopeful.
“Some days you need a hand. There are other days when we're called to lend a hand. That's how it has to be, that's what we do for one another. And if we are that way our country will be stronger, more prosperous, more ready for the future. And we can still disagree.” - Joe Biden, 2021 Inaugural Address
As a citizen, I was inspired by Joe's promise for a united nation.
But as a developer I started to wonder- could I quantify this hope?
The inaugural address is a president’s first speech to the nation. The speech is meticulously written by a team of writers to capture the mood of the nation and the most pressing issues that we face.
Could the specific words used in this speech give us insight into the path ahead?
I compared the top 20 most common words in Biden's speech with the top 20 words in Trump's 2017 inaugural address to see where our country is now compared to four years ago, and what to expect over the next four years.
Table of Contents
Using Python to Find the Top 20 Most Common Words
This next section is a tutorial for the Python analysis. If natural language processing doesn't get you excited, then you may want to jump to the end (but it's also only 20 lines of code so could be fun to learn!)
The goal for this analysis is to take each inaugural address and find the most common words. The analysis consists of two parts:
- Scraping the speech from the web using Beautiful Soup
- Processing the words using NLKT
If you want to run the code at home, this is what you'll need to do to get set up:
- Install python 3
- Install
requests
,BeautifulSoup
andnltk
withpip3 install
-
brew install jupyter
and then open a jupyter notebook by runningjupyter notebook
.
Now you can run all of the commands below in the jupyter notebook!
If you want to skip the scraping and cleaning, you can download Arctype and use the database credentials at the end to view the data.
1. Web Scraping with Beautiful Soup
Web scraping is the process of collecting information from the web. In this scenario, we're going to be scraping transcripts of each president's inauguration speech.
You can find each president's speech at these websites:
We first use the requests
package to scrape the entire HTML code from each website.
import requests
URL = 'https://www.yahoo.com/now/full-transcript-joe-bidens-inauguration-175723360.html'
page = requests.get(URL)
Congrats you've built your first web scraper!
This code is making a HTTP request to retrieve the HTML code from the server that the speech is stored at.
Now we have to take this mess of HTML and find just the text from each president's speech. We can do this easily with Python's Beautiful Soup package.
from bs4 import BeautifulSoup
biden_speech = BeautifulSoup(page.content, 'html.parser')
In the code above we've converted the HTML from earlier into a beautiful soup object that is easily parseable.
Now we have to find the specific HTML block that contains the text we're looking for. We can do this using the browser's DevTools console.
Open the speech in a new tab in your browser and press cmd+option+I
to open the DevTools console. Highlight the text you're looking for, and you'll be able to see the HTML tag that contains that text in the console on the right.
For Biden's speech, we can see that it's contained in a <div>
tag labelled with a caas-body
class name. Switching back to Python, we can find that tag using the find_all
method with our beautiful soup object from before.
biden_speech_content = biden_speech.find_all('div', class_='caas-body')
When we look at the biden_speech_content
object, we'll still find other html tags that aren't related to the speech such as:
<div class="caas-readmore caas-readmore-collapse">
<button aria-label="" class="link rapid-noclick-resp caas-button
collapse-button" data-ylk="elm:readmore;slk:Story continues"
title="">
Story continues
</button>
</div>
In order to find just the text from Biden's speech, we can filter for the <p>
tags that aren't labeled with a class:
biden_speech_content_v2 = biden_speech_content[0].find_all('p', attrs={'class': None})
Now we have all the text, but the string <p>
is appended to the beginning of every sentence. We can remove these HTML tags with the Beautiful Soup get_text
method:
biden_speech_str = ""
for sentence in biden_speech_content_v2:
text = sentence.get_text()
biden_speech_str = biden_speech_str + " " + text
Finally, we should be left with a clean speech that we can analyze with the nlkt
package.
2. Finding Word Frequency with NLKT
We're getting close to the end now! The final steps are doing some basic natural language processing (NLP) techniques using the Python NLP package, NLKT
.
We could do a frequency analysis of the speech now, but this would show words like "I", "We", and "The" as the most common words. In natural language processing these are called stop words.
We can use NLKT's list of English stop words to find just the words that we're interested in.
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist
biden_words = word_tokenize(biden_speech_str.lower())
filtered_biden_speech = [w for w in biden_words if not w in stop_words and w.isalpha()]
Let's break down what the code is doing:
- Using
.lower()
to cast the entire speech to lower case so it can be compared to the stop words - Separating the string into individual words with
word_tokenize
- Removing stop words:
if not w in stop_words
- Removing punctuation like periods and commas:
w.isalpha()
Now we have a list of words that we can count!
freq = FreqDist(filtered_biden_speech)
print (freq.most_common(20))
But what you might find as you look through the list is that there are separate counts for similar words such as "country" and "countries". In order to count these as one word, we have to lemmatize the list so that every word is converted to its base word.
from nltk.stem import WordNetLemmatizer
lemmatized_biden = [wordnet_lemmatizer.lemmatize(word) for word in filtered_biden_speech]
freq_lemma = FreqDist(lemmatized_biden)
print (freq_lemma.most_common(20))
Done! You've successfully scraped data from the web and analyzed it with NLP all while supporting democracy. Let's take a look at the results.
Biden's vs. Trump's Inauguration Speeches: Most Frequent Words
k,v = zip(*freq_lemma.most_common(10))
fig = px.bar(x=v,y=k, orientation='h')
fig.update_layout(yaxis=dict(autorange="reversed"))
fig.show()
The top word was distorted by the lemmatizer, but the word was "us".
These were the top 10 words from Trump's speech in 2017:
What stood out to me is that 50% of the top 10 words in for both presidents were the same:
- America
- American
- Nation
- People
- One
The optimistic side in me looks at this data and sees a nation that shares common values. We care about our country, and we care about each other.
But at the same time, we are all facing our own unique issues. If we look at the next 10 most common words for each president's speech we begin to see some differences.
Biden's speech was undeniably a call to bring our nation together
in unity
. On the other side, we can see Trump appealing to Americans whose job[s]
are under threat and need to protect
their livelihood and families.
The data shows two groups of people facing their own challenges, but I also see one nation with common values.
We set off to see if we could quantify "hope". And I believe we found an answer.
If two presidents with polar opposite political views can appeal to their supporters with 50% of the same vocabulary, then there is still hope to unite around our similarities.
What are the common objects we as Americans love, that define us as Americans? I think we know. Opportunity, security, liberty, dignity, respect, honor, and yes, the truth. - Joe Biden, 2021 Inaugural Address
A Full Speech Comparison with Arctype
I shared the top 20 words, but there were more than 500 unique words in Biden's inauguration speech. If you want to see more analysis, we've uploaded all the speech data to Arctype so you can skip the scraping and cleaning.
The dataset includes 2 tables:
- Frequencies table: full list of the word frequencies for both speeches
- Sentences tables: cleaned sentences for both speeches so you can do your own analysis
Here's how to connect to the data:
- Download the free Arctype SQL Client
- Input the credentials below in Arctype to connect to the database
- Run a query!
Database credentials:
- host:
arctype-pg-demo.c4i5p0deezvq.us-west-2.rds.amazonaws.com
- port:
5432
- user:
root
- password:
HC9x0OkI9vVO4wqprscg
- database:
inauguration_2021
If you enjoyed this post, sign up for the Arctype newsletter to receive more posts written by experienced developers to help and inspire other devs.
Top comments (3)
This is a great way to get into Python data analysis!
Wow, this is really cool!
Thank you! Python makes it super easy to get started with NLP