Justin L Beall

Posted on Dec 13, 2020

Social Learning Journal - Classification

#python #tdd #classifier #learningjournal

Building upon my previous Social Learning Journal - Parsing Audiobooks post, let's walk through creating a simplified version of a Naive Bayes classifier.

Typically, I post events that are related to one of three categories Software Engineering, Agile, and Leadership. Sometimes I have domain related content, such as Ballistics. Conveniently, Twitter comes with a built-in classification system, aka hashtags.

Outcomes

In this session, my goal is to classify my journaled events for the year. This will give a raw density count that will help understand where this year's learning efforts have been primarily focused. Granted I am focused on the above-mentioned categories, but a similar process can be done for any targeted skillset.

The JSON structure of the Tweet provided by Twitter already has "hashtags" identified in the data set.

"hashtags" : [ {
  "text" : "agile",
  "indices" : [ "263", "269" ]
} ]

Creating the Data Set

Leveraging the script already written to Parse AudioBooks, we already have a way to extract the Tweets for a year.

I am a big fan of Test-Driven Development (TDD), and since we have a function that we will have in multiple places, let's switch out of the Hacker/Scripter Hat and put on the Production Hat.

If you are new to TDD, writing a test first seems like an unnatural process. Use the acronym ZOMBIES to get started. James Grenning wrote a great post explaining this process, TDD Guided by ZOMBIES.

Z – Zero
O – One
M – Many (or More complex)
B – Boundary Behaviors
I – Interface definition
E – Exercise Exceptional behavior
S – Simple Scenarios, Simple Solutions

Zero / Interface Definition

I wrote a test that dictates my contract (red). Often times the Zero step and the Interface definition steps are mixed. By laying out a zero case, we are actually defining the inputs and outputs for the class and method.

# time_extractor_test.py
from src.extractors.time_extractor import TimeExtractor


def test_extract_tweets_for_year_zero():
    tweet_data = []

    time_extractor = TimeExtractor(tweet_data)

    tweets_for_year = time_extractor.tweets_for_year("2020")

    assert not tweets_for_year

Even without writing a single line of "production" code, we have a lot of information already established on how we expect our function to behave. This test fails, so I switch to implementation (green).

# time_extractor.py
from typing import List, Dict


class TimeExtractor:
    tweet_data = None

    def __init__(self, tweet_data: List[Dict]):
        self.tweet_data = tweet_data

    def tweets_for_year(self, year: str) -> List[Dict]:
        return []

With this little bit of code, our test is now green. In TDD, the intention is to write the least amount of code that is necessary to make the test pass. It's heavily built upon the concept of YAGNI, "you aren't gonna need it". Premature optimization, dead code, gold plating, and over architecting are the bane of legacy production systems.

One

Given we have an existing data structure, let's mock-up positive and negative data for the next test cases. The Twitter data structure is burdensomely large, but because we only care about the "created_at" attribute it can be simplified as such:

tweet_2020 = {
    'tweet': 
        {
            'created_at': 'Thu Jan 09 20:23:34 +0000 2020', 
        }
}

tweet_2019 = {
    'tweet': 
        {
            'created_at': 'Thu Jan 09 20:23:34 +0000 2019', 
        }
}

The only difference between these two data elements is the year. Our second set of tests looks like this (red):

def test_extract_tweets_for_year_one():
    expected_count = 1
    tweet_data = [tweet_2020]

    time_extractor = TimeExtractor(tweet_data)

    tweets_for_year = time_extractor.tweets_for_year("2020")

    assert expected_count == len(tweets_for_year)


def test_extract_tweets_for_year_not_found():
    expected_count = 0
    tweet_data = [tweet_2020]

    time_extractor = TimeExtractor(tweet_data)

    tweets_for_year = time_extractor.tweets_for_year("2019")

    assert expected_count == len(tweets_for_year)

# results
>       assert expected_count == len(tweets_for_year)
E       assert 1 == 0
E        +  where 0 = len([])

To make this test pass we will take a lazy approach, which is one of The Programmer's Virtues. It seems silly, but we are just going to look for the year text in the date string and return the data if it is there, otherwise an empty array. In a step or two down the road, we will actually implement the filter from the parsing of AudioBooks code.

def tweets_for_year(self, year: str) -> List[Dict]:
    if not self.tweet_data:
        return []

    created_at = self.tweet_data[0]['tweet']['created_at']
    return self.tweet_data if year in created_at else []

This little bit of code we have both of our tests passing (green), but it is not a practical implementation.

Many

In this step, we create a test that has both of our data elements and assert that only one of them is returned (red).

def test_extract_tweets_for_year_many():
    expected_count = 1
    tweet_data = [tweet_2019, tweet_2020]

    time_extractor = TimeExtractor(tweet_data)

    tweets_for_year = time_extractor.tweets_for_year("2020")

    assert expected_count == len(tweets_for_year)

Now that we have a bit more complicated situation, we actually implement some iteration logic to grab the correct dates.

    def tweets_for_year(self, year: str) -> List[Dict]:
        if not self.tweet_data:
            return []

        tweets_for_year = []

        for tweet in self.tweet_data:
            created_at = tweet['tweet']['created_at']

            if year in created_at:
                tweets_for_year.append(tweet)

        return tweets_for_year

Simple

At this point, we have a viable solution that could be used to grab a list of Tweets by year. With all of our tests passing, we can refactor the code a bit to pass in our more comprehensive date filter. We will use some built-in functions to perform the logic implemented by hand.

from typing import List, Dict

from dateutil.parser import parse

FIRST_OF_YEAR = "Jan 1 00:00:00 +0000 "
END_OF_YEAR = "Dec 31 23:59:59 +0000 "


class TimeExtractor:
    tweet_data = None

    def __init__(self, tweet_data: List[Dict]):
        self.tweet_data = tweet_data

        self._year = None
        self._first_of_year = None
        self._end_of_year = None

    def tweets_for_year(self, year: str) -> List[Dict]:
        if not self.tweet_data:
            return []

        self._year = year
        self._first_of_year = parse(f'{FIRST_OF_YEAR}{year}')
        self._end_of_year = parse(f'{END_OF_YEAR}{year}')

        return list(filter(self._filter_by_year, self.tweet_data))

    def _filter_by_year(self, tweet: dict) -> bool:
        created_at = parse(tweet['tweet']['created_at'])
        return self._first_of_year <= created_at <= self._end_of_year

Our tests still pass, so we know that our refactored solution is still viable for the desired behavior. If this was a commercial production system, some edge behavior tests should be included - IE, missing attributes, first/end of year, etc - which would cover the Boundaries and Exceptional behaviors in the ZOMBIES acronym.

Extracting the Data Set

Given the new function we wrote, we can easily grab all of the Tweets for this year.

# hashtags-for-this-year.py
import json
import os
from datetime import datetime

from dotenv import load_dotenv

from src.extractors.time_extractor import TimeExtractor

load_dotenv()

DATA_SEED_TWITTER_PATH = os.environ.get("DATA_SEED_TWITTER_PATH", "./data/tweet.json")

current_year = datetime.today().year

if __name__ == "__main__":
    with open(DATA_SEED_TWITTER_PATH) as data_seed:
        data = json.load(data_seed)

    time_extractor = TimeExtractor(data)
    tweets_from_this_year = time_extractor.tweets_for_year(current_year)

This is the start of the script we will use to extract unique hashtags from the events. All there is left to do is loop through the data (449 items) and add each unique hashtag to a set. With the following code, this is whittled down into 64 unique tags that will be used in the classifier.

hashtag_set = set()
for tweet in tweets_from_this_year:
    hashtags = [hashtag_entity['text'].lower() for hashtag_entity in tweet['tweet']['entities']['hashtags']]
    hashtag_set |= {*hashtags}

for hashtag in sorted(list(hashtag_set)):
    print(hashtag)

Tag Classification

There is no magical way to classify data tags. One of the hardest parts of supervised machine learning is to appropriately tag and classify a viable set of data. Based on the tags, I created a dictionary variable that maps each hashtag to one of the four classifications as indicated earlier: Engineering, Agile, Leadership, Other. The full list can be found on GitHub here.

Classifying Tweets

The last step in our classification routine is to utilize our created tags and apply them to each event to determine its unique classification. Using some functional programming concepts, I grab the list of hashtags for each event then reduce them based upon the max classification value.

// classify-tweets-for-this-year.py
import json
import os
from datetime import datetime
from functools import reduce
from typing import List

from dotenv import load_dotenv

from src.classifiers.hashtags import CLASSIFICATION_ENGINEERING, CLASSIFICATION_AGILE, \
    CLASSIFICATION_LEADERSHIP, CLASSIFICATION_OTHER, classified_hashtags
from src.extractors.time_extractor import TimeExtractor

load_dotenv()

DATA_SEED_TWITTER_PATH = os.environ.get("DATA_SEED_TWITTER_PATH", "./data/tweet.json")

current_year = str(datetime.today().year)

classifications = {
    CLASSIFICATION_ENGINEERING: 0,
    CLASSIFICATION_AGILE: 0,
    CLASSIFICATION_LEADERSHIP: 0,
    CLASSIFICATION_OTHER: 0
}


def classify(classification: str, tweet_hashtags_to_classify: List[str]) -> int:
    return len(list(filter(
        lambda hashtag: classification == classified_hashtags.get(hashtag), tweet_hashtags_to_classify)))


def reduce_classifications(result: dict, tweet_hashtags: list) -> dict:
    classification_keys = result.keys()

    tag_classifications = {
        classification: classify(classification, tweet_hashtags)
        for classification in classification_keys
    }

    is_all_zeros = not sum(tag_classifications.values())
    if is_all_zeros:
        classification = CLASSIFICATION_OTHER
    else:
        classification = max(tag_classifications, key=tag_classifications.get)

    result[classification] = result[classification] + 1

    return result


if __name__ == "__main__":
    with open(DATA_SEED_TWITTER_PATH) as data_seed:
        data = json.load(data_seed)

    time_extractor = TimeExtractor(data)
    tweets_from_this_year = time_extractor.tweets_for_year(current_year)

    tweet_hashtags_from_this_year = [
        [hashtag_entity['text'].lower()
         for hashtag_entity in tweet['tweet']['entities']['hashtags']]
        for tweet in tweets_from_this_year]

    classified_tweets = reduce(reduce_classifications, tweet_hashtags_from_this_year, classifications)

    for key, value in classified_tweets.items():
        print(f'{key}: {value}')

2020 Tweet Distribution

Engineering: 86
Agile: 162
Leadership: 35
Other: 166

DEV Community