Note: This post was written for work done on the 15th of June, 2020. If you want to catch up with the first four days, which you definitely do not need to do, the articles are available on my blog.
It has been a while since I last wrote one of these blog posts but that was because I really hadn't learned or done anything of value. However, I have now begun work on an incredibly interesting project that I'll now be writing about every day until it reaches completion. I'm currently being mentored by Dr. Srivastava and Dr. Chaturvedi at UNC Chapel Hill and working with one of my classmates, Bhargav Vaduri, on the project. Without further ado, let me get into covering what the project is and what work we've done on it so far. If you want to see the code for the project, here is our github repository.
What we're attempting to do with the project is analyze lines from movies and then identify the gender of the character who said the line. The first big goal we want to hit with this project is to be able to build a classifier that can identify the gender of the speaker of a line to a fairly high degree of accuracy. Hopefully, we'll be able to hit that goal soon and then move on to trying to hit other goals using the same dataset.
Speaking of our dataset, we'll be using the Cornell Movie-Dialogs Corpus which is a massive textual dataset that contains dialogue from 617 movies, with 304,713 lines of dialogues. Since we're focusing on the gender of the characters that said the utterance (line of dialogue), we are only focusing on two of the files in the dataset: movie_lines.txt and movie_characters_metadata.txt. This is only for the time being, as we make progress with the project we will most likely use other data from the dataset to perform other analyses.
Most of today was spent working on preprocessing the data, formatting it to our needs, and reading the original research paper that accompanied the corpus: Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. I've already mentioned that we wanted to format our data in a specific way that would make it easier for us to work with it. The main thing we're doing here is adding information about the character's gender and their position in the credits to the information present about each line of dialogue in movie_lines.txt. We then put this combined information into a new text file that we are calling collated_data.txt. This file is 304,713 lines long with 7 tokens on each line: the line number in the script, the movie id, the character id, the character gender, character name, and the text of the utterance. This was a simple task to do with python and we got it done in the day. We also did some precursory analysis of the data. I attempted to POS tagging of the textual data today, but formatting in the way I wanted got to be a little tedious for how tired I was by the end of the day, so I decided to finish it on Tuesday morning. Our results for the day are present in the next section, it is pretty sparse since we only really did preprocessing, however, Tuesday's results section will be significantly bigger.
Our precursory analysis focused on seeing how the data breaks down by gender and this is what we found:
- The number of male characters is 2049
- The number of instances of a male character speaking is 170,168.
- The number of female characters is 966
- The number of instances of female speech is 71255
- The number of characters of unknown gender is 6020
- The number of instances of speech from a character of unknown gender is 62690.
Like I mentioned at the beginning of today's post, the code for the project is being hosted on our github repository. So if you want to check that out, be my guest! Until then, keep coding!
As always, if you want to keep up to date with my work, consider subscribing to my newsletter here.