DEV Community

Lorena
Lorena

Posted on • Updated on • Originally published at lorenaciutacu.com

6 findings from analysing Oscars speeches with Python

On the occasion of the 93rd Oscars Award Ceremony, I was curious to do some text mining on the acceptance speeches. Specifically, I analysed the speeches of the Best Directors between 1941 and 2019. I used a dataset from Kaggle and added missing data for 2017, 2018, and 2019 directly from the Academy Awards Acceptance Speech database.

In total, 74 Best Directors have been awarded and almost all of them gave acceptance speeches, which I analysed with Python and the NLTK library. You can find the Jupyter notebook here. Let's see what the words reveal about the Best Directors and the Oscars!

1. Average speech length

The speech of a Best Director has 104 words on average, but speeches range widely from 8 to 267 words.

Here's how to calculate the number of words in a text:

directing["words"] = directing['Speech_clean'].str.split().str.len()
Enter fullscreen mode Exit fullscreen mode

2. Longest & shortest speeches

The longest speech runs at 267 words and was given by Mel Gibson at the 68th Academy Awards in 1995 for his film Braveheart. This guy had a looot of people to thank to and seems to have used up all his words for saying pretty much nothing.

The shortest speech was summed up in 8 words by Delbert Mann at the 28th Academy Awards in 1955 for his film Marty. I really like his efficient "I came. I won. I thanked." structured speech:

Thank you. Thank you very much. Appreciate it.

Here's how to find the longest and shortest text in a dataframe with pandas:

directing.sort_values(by="words")
Enter fullscreen mode Exit fullscreen mode

3. Lexical richness

Lexical richness is a measure of how many unique words are used in the text. Lexical richness is calculated as the total number of unique words divided by the total number of words. The higher the score, the richer the vocabulary–and vice-versa. Here's to calculate lexical richness for each speech in the dataframe with Python:

def lexical_richness(text):
    return round(len(set(str(text))) / len(str(text)), 3)
directing["lex_rich"] = [lexical_richness(directing["Speech_clean"][i]) for i in range(len(directing))]
Enter fullscreen mode Exit fullscreen mode

The speech with the highest lexical richness (0.408) is Delbert Mann's, the director of Marty, awarded in 1955. This means that 40.8% of the words he used are distinct.

At the other end, the speech with the lowest lexical richness (0.034) is Mel Gibson's, the director of Braveheart, awarded in 1995. This means that 3.4% of the words he used are distinct.

4. Longest words

The longest words used in directors' speeches have 15 words: administrations, cinematographer, and czechoslovakian.

Here's how to select the longest words in a text:

long_words = [w for w in all_speeches_tokenized if len(w) > 14]
sorted(long_words)
Enter fullscreen mode Exit fullscreen mode

5. Most common words

The top 10 most common words in all acceptance speeches are: thank (201 occurrences), much (56), like (50), people (48), want (42), would (30), movie (26), film (26), say (24), and many (22).

Interestingly, out of these 10 words, 3 are nouns (referring to people and film/movie), 2 express large quantities (much and many), and 5 are verbs that express personal feelings (want, like) or actions (say, thank). It's also worth noting that the word thank has a significantly higher frequency than the following common words, which is however understandable.

Here's how to find the frequency distribution of words in a text with NLTK:

FreqDist(all_speeches_tokenized).most_common(10)
Enter fullscreen mode Exit fullscreen mode

6. "Thank" to...

Ok, winners thank a lot, but who do they thank to? It turns out... to you, but also to the Pacific Command of the United States, Mr. harry Cohn, Marlon, the producers, and each one of them.

Here's see the location of a word in context with NLTK:

Text(word_tokenize(all_speeches)).concordance('thank')
Enter fullscreen mode Exit fullscreen mode

That's all, folks! Could/should I have analysed anything else? Let me know what you think in the comments below ⬇️

Discussion (0)