Justin L Beall

Posted on Dec 7, 2020

Social Learning Journal - Parsing Audiobooks

#python #learningjournal #aggregates

Since January of 2017, I have been documenting all of my professional learning activities on Twitter. As an established software engineer, I want to share the process it takes to get here. Here is my Twitter account, dev3l_.

I use structured messages for journal-specific events. For example, when I listen to a podcast I start the Tweet with "Listened to:". I have similar semantic tags for reading books, listening to books, and attending conferences/courses. In addition, I hashtag the Tweets with classification identifiers, such as #agile. Typically, I try to write a few notes about the event. Finally, the event is tagged with a duration using the carrot symbol and an amount, such as "^45m" to signify 45 minutes in length.

At one point, Gary Vaynerchuk had a video called Document, Don't Create. This has been my attempt to do such.

New Project Setup

I am starting off based upon my previous post, Message bot to find a PS5 on sale. I intend this to be a Python Flask API using MongoDB that will support a ReactJS front end. For today, I want to get CI setup with Travis, coverage using Coveralls, code quality on Code Climate, and the whole thing hosted on Heroku. If anyone has questions on how to set these technologies up, leave a comment below and I will create some more detailed steps.

Outcomes

For today, my goal is to get a Twitter data archive parsed to create a list of Audiobooks I have listened to for the year. On Twitter, you can go to your Settings and privacy and request an archive download of all your data. This is what I will be using as a starting point, seed data. Inside this archive, is a tweet.js file that contains every Tweet from the account in JSON format.

The JSON structure of a Tweet from an archive and the Twitter API is identical. By creating a Tweet parser as the first step using a static export, this will be able to be leveraged in a future scheduled job that will dynamically handle Tweets in near real-time.

In the future, I imagine pulling from data sources outside of Twitter. Any platform with an available API could be used, like GitHub, LinkedIn, and YouTube.

Structured Messages

When I first started doing this, I did not know the importance of structured messages. A few years ago, I went to create an initial prototype and realized it was pretty hard to pull meaningful data out - it took a lot of hand manipulation. Given an appropriate classifier with machine learning, it would have been possible, but for now, it is much easier to just add a little bit of metadata to each message.

Instead of working with the file tweet.js as a JavaScript file, I removed the JS window.YTD.tweet.part0 = from the file and save it as tweet.json. This can easily be imported into Python as a JS document and we can start working on it relatively easily.

import json
with open(DATA_SEED_TWITTER_PATH) as data_seed:
    data = json.load(data_seed)

With three lines of code, I now have access to start manipulating my 4556, at the time of this article, journaled events.

This is a lot of data when all I want is just to see the audiobooks for this year. Next, let's filter out the data set to show only items from this year based upon the created_at attribute.

def filter_by_this_year(tweet: dict) -> bool:
    created_at = parse(tweet['tweet']['created_at'])
    return first_of_year <= created_at <= end_of_year

tweets_from_this_year = list(filter(filter_by_this_year, data))

Now our data is a little bit more manageable at 449 events. Applying another filter, I'll look for the text of "Started listening to:" to whittle it down to the list of the books I am interested in.

def filter_by_audiobook_start(tweet: dict) -> bool:
    text = tweet['tweet']['full_text']
    return "Started listening to:" in text
audio_books_from_this_year = list(filter(filter_by_audiobook_start, tweets_from_this_year))

At this point, I have found 11 books. This makes sense to me as I have an Audible subscription that allows for one book a month. I'm not perfect with my annotations and sometimes log the end of a book without marking the start of the book. So I changed the filter to include "Finished listening to:" and found one other book.

"Started listening to:" in text or "Finished listening to:" in text

Audiobook List

Now that we have identified the list of books. It's time for a bit of string manipulation to get to the title. After the type identifier tag, I put the name of the book followed by a newline. Using a reduce function, we can easily put these titles into a set and create a unique list of books listened to.

def reduce_book_titles(result: set, tweet: dict) -> set:
    text = tweet['tweet']['full_text']
    title = text.split(":")[1].split("\n")[0]
    result.add(title)
    return result
audio_book_titles = list(reduce(reduce_book_titles, audio_books_from_this_year, set()))

Looping through and printing the titles I can see that I have the following audiobooks under my belt for the year. The full source to this simple yet powerful script can be found here: parse-audiobook-tweets-for-this-year.py

2020 Audiobooks:

The Unicorn Project - A Novel About Developers, Digital Disruption, and Thriving in the Age of Data
Talking to Strangers
Understanding Software - Simplicity, Coding, and How to Suck Less as a Programmer
Escaping the Build Trap - How Effective Product Management Creates Real Value
Good to Great - Why Some Companies Make the Leap...And Others Don't
Agile Conversations - Transform Your Conversations, Transform Your Culture
Creativity, Inc. - Overcoming the Unseen Forces That Stand in the Way of True Inspiration
Doing Agile Right - Transformation Without Chaos
The Pragmatic Programmer
The 7 Habits of Highly Effective People - Powerful Lessons in Personal Change
Sense & Respond - How Successful Organizations Listen to Customers and Create New Products Continuously
The Infinite Game
The Art of War

DEV Community