DEV Community


Posted on

Topic modelling with Gensim and SpaCy on startup news

This winter I'm embarking on a new NLP project with the goal to analyse global investment trends in clean energy start ups.

One of the best sources of start up news globally is TechCrunch, so to have a high-level overview of all possible start up news topics, I've extracted the articles for the last ten years.

For the topic modelling itself, I am going to use Gensim library by Radim Rehurek, which is very developer friendly and easy to use.

1. Text preprocessing

The TechCrunch collection of startup news is absolutely amazing. I've extracted those by using the api, so data would require some cleaning.

Let's start by fetching the data from AWS S3 bucket.

# Getting data from AWS S3 bucket
s3 = boto3.client('s3')
obj = s3.get_object(Bucket = 'process-news',Key = 'techcrunch_data.csv')

techcrunch_data = pd.read_csv(obj['Body'])


#(32634, 54)
Enter fullscreen mode Exit fullscreen mode

The dataset contains more than 30,000 news articles, however the text for title and body needs to be extracted from the 'rendered' key.

"{'rendered': 'Irish virtual sports giant’s new startup bets big on rugby in bid for US market share'}"

# Getting the text from 'rendered' key
techcrunch_data["clean_title"] =  [ast.literal_eval(x)['rendered'] for x in techcrunch_data["title"]]
techcrunch_data["clean_content"] =  [ast.literal_eval(x)['rendered'] for x in techcrunch_data["content"]]
Enter fullscreen mode Exit fullscreen mode

Now, to get the complete data for our model, let's concatenate title and body of the article.

# Concatenate title and content in one string
techcrunch_data['clean_text'] = techcrunch_data["clean_title"] + ' ' + techcrunch_data["clean_content"]
Enter fullscreen mode Exit fullscreen mode

The text is still far from being ready for the model though.

Enter fullscreen mode Exit fullscreen mode

article before processing

The article needs to be stripped from html markup and unicode characters, I also want to replace TechCrunch specific quotation marks and /n, /t and /r symbols.

# getting a text from the markup
techcrunch_data["clean_text"] = [ BeautifulSoup(text, 'lxml').text for text in techcrunch_data["clean_text"]]

# remove unicode characters
techcrunch_data["clean_text"] = [unicodedata.normalize('NFKD', x) for x in techcrunch_data["clean_text"]]

# remove quotation marks
techcrunch_data["clean_text"] = [re.sub(r'[“”@()–-]+',' ', x) for x in techcrunch_data["clean_text"]]

# remove repeating spaces
techcrunch_data["clean_text"] = [re.sub(r'\s+',' ', x) for x in techcrunch_data["clean_text"]]

Enter fullscreen mode Exit fullscreen mode

As the result the text looks much cleaner, and we can proceed with further processing and lemmatisation using SpaCy.

Article after text cleaning

2. Cleaning data with SpaCy

SpaCy is one of the most popular NLP libraries, and is very fast and flexible.

I will use it for lemmatisation and to extract only nouns for my topics.

To speed up the processing, I will disable a Named Entity Recognition part of a SpaCy pipeline. The corpus is quite large, so I will use nlp.pipe to speed up the processing.

Additionally SpaCy will handle stop words removal, which are just most common words that don't have a lot of information value.

# loading the model
nlp = spacy.load("en_core_web_sm")

# remove ner function to speed up the processing

# pipeline is a series of functions that is applied to a text

# ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'] 

# defining the dataset
dataset = techcrunch_data['clean_text']

docs = []

# nlp.pipe makes processing more efficient
for text in tqdm(nlp.pipe(dataset), total=len(dataset)):
    doc = nlp(text) 
    pt = [token.lemma_.lower() for token in doc if
           (len(token.lemma_) > 1 and token.pos_ == "NOUN" and 
          not token.is_stop)]

Enter fullscreen mode Exit fullscreen mode

3. Words removal

We already removed stop words with SpaCy, however there are further processing I decided to do to improve the accuracy of the topics model.

I removed words that appear just once in the whole corpora, as they won't have any significance.

# count how many times a word occurs in our corpus
counts_word_occurence = Counter(chain(*[x for x in data]))

# get a list of words that appeared only once in the whole corpus
low_freq_words = {key:value for (key,value) in counts_word_occurence.items() if value==1}

# 14272

# drop words that appear only once in the whole dataset
docs = [[lemma for lemma in text if counts_word_occurence[lemma]>1] for text in docs]

# drop words that appear only once in the whole dataset
docs = [[lemma for lemma in text if counts_word_occurence[lemma]>1] for text in docs]
Enter fullscreen mode Exit fullscreen mode

Additionally, I've removed words that appear in the vast majority of the articles, like startup, founder and so on. All these news are about start ups, and I wanted to dive into more details.

# Dropping words that occur in more then 40% of the articles

# Getting the corpus length 

# calculate in how many documents a word appeared
counts_word_percentage = Counter(chain(*[set(x) for x in docs]))

# calculate in what % of all articles a word appears
counts_word_percentage = {key:(value/docs_length)*100 for (key,value) in counts_word_percentage.items()}

# get words with high frequency
high_freq_words = {key:value for (key,value) in counts_word_percentage.items() if value>40}


Enter fullscreen mode Exit fullscreen mode

High frequency words

Now, when we have only noun lemmas for our corpus and removed all the words I decided not to use, let's look how our article looks like.

# our article transformed into a list of lemmas
pp = pprint.PrettyPrinter(compact=True)
Enter fullscreen mode Exit fullscreen mode

article as a list of cleaned lemmas

Once we removed some words, the articles length has changed and it would be quite interesting to see its distribution.

lengths =  [len(x) for x in docs]
# Build the histogram, distribution of lemmas in the texts
Enter fullscreen mode Exit fullscreen mode

Distribution of lemmas in the text

The list with cleaned lemmas is available on my GitHub.

4. Topic modelling with the best number of topics.

One of the most popular algorithms for topic modelling is Latent Dirichlet allocation. It's a generative probabilistic model, based on Bayesian theorem.

In its essence, words that occur together repeatedly in a text corpora will be used to generate topics.

However, topic models are hard to interpret. Once of measures that give us some insight to our model is a Coherence score, which measures how subset of words fits together.

Gensim has a CoherenceModel method, so we could use it to evaluate the best number of topics for our corpus. I will try two methods, available in Gensim: 'u_mass', 'c_v'.

Let's write a function, that will calculate both 'u_mass' and 'c_v' coherence measures on our model for different number of topics.

# Defining dictionary and corpus with Gensim
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(text) for text in docs]

def calculate_coherence(dictionary, corpus, docs, start, stop):
    scores = []
    for topics in range(start, stop):

        # defining the model
        lda_model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=26, alpha='auto', eval_every=5)

        # U_mass coherence score
        cm_u_mass = CoherenceModel(model=lda_model, corpus=corpus, dictionary=dictionary, coherence='u_mass')
        u_mass_coherence = cm_u_mass.get_coherence()

        # C_v coherence score
        cm_c_v = CoherenceModel(model=lda_model, texts=docs, dictionary=dictionary, coherence='c_v')
        c_v_coherence = cm_c_v.get_coherence()

        values = [topics, u_mass_coherence, c_v_coherence]


    return scores

# calculare scores
scores = calculate_coherence(dictionary, corpus, docs, 10, 30)
Enter fullscreen mode Exit fullscreen mode

Having the scores will allow us to assess what number of topics that would give the best coherence.

# scores to df
df = pd.DataFrame(scores, columns = ['number_of_topics','u_mass_coherence','c_v_coherence'])

# tidying the df
df = df.melt(id_vars=['number_of_topics'], value_vars=['u_mass_coherence','c_v_coherence'])

# Plotting u_mass_coherence
sns.lineplot(data=df.loc[df['variable'] == 'u_mass_coherence'], x="number_of_topics", y="value").set_title('u_mass coherence')

# Plotting c_v_coherence
sns.lineplot(data=df.loc[df['variable'] == 'c_v_coherence'], x="number_of_topics", y="value").set_title('c_v coherence')
Enter fullscreen mode Exit fullscreen mode

U_mass Coherence

C_V coherence

6. Visualising the results

According to our data the model that gives best coherence score has 22 topics. Previously, we already defined a corpus and a dictionary, so it's time to build the final model.

lda_model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=22, alpha='auto', eval_every=5)

# print topics
Enter fullscreen mode Exit fullscreen mode


pyLDAvis library allows to visualise topics as bubbles with keywords, which is super cool if you want to explore them in more details.

viz = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
Enter fullscreen mode Exit fullscreen mode

Visualizing topics

7. Saving the model

Once the model is built, we could save it on disk for the future use on unseen data.

# save the model to the disk
temp_file = datapath('model')
Enter fullscreen mode Exit fullscreen mode

Link to GitHub:

Useful resources:

Evaluate Topic Models: Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation

Exploring the Space of Topic Coherence Measures

Evaluation of topic modeling topic coherence

Discussion (0)