MariaZentsova

Posted on Jan 17, 2022

Topic modelling with Gensim and SpaCy on startup news

#python #datascience #machinelearning #tutorial

This winter I'm embarking on a new NLP project with the goal to analyse global investment trends in clean energy start ups.

One of the best sources of start up news globally is TechCrunch, so to have a high-level overview of all possible start up news topics, I've extracted the articles for the last ten years.

For the topic modelling itself, I am going to use Gensim library by Radim Rehurek, which is very developer friendly and easy to use.

1. Text preprocessing

The TechCrunch collection of startup news is absolutely amazing. I've extracted those by using the api, so data would require some cleaning.

Let's start by fetching the data from AWS S3 bucket.

# Getting data from AWS S3 bucket
s3 = boto3.client('s3')
obj = s3.get_object(Bucket = 'process-news',Key = 'techcrunch_data.csv')

techcrunch_data = pd.read_csv(obj['Body'])

techcrunch_data.shape

#(32634, 54)

The dataset contains more than 30,000 news articles, however the text for title and body needs to be extracted from the 'rendered' key.

"{'rendered': 'Irish virtual sports giant’s new startup bets big on rugby in bid for US market share'}"

# Getting the text from 'rendered' key
techcrunch_data["clean_title"] =  [ast.literal_eval(x)['rendered'] for x in techcrunch_data["title"]]
techcrunch_data["clean_content"] =  [ast.literal_eval(x)['rendered'] for x in techcrunch_data["content"]]

Now, to get the complete data for our model, let's concatenate title and body of the article.

# Concatenate title and content in one string
techcrunch_data['clean_text'] = techcrunch_data["clean_title"] + ' ' + techcrunch_data["clean_content"]

The text is still far from being ready for the model though.

techcrunch_data['clean_text'][25675]

The article needs to be stripped from html markup and unicode characters, I also want to replace TechCrunch specific quotation marks and /n, /t and /r symbols.

# getting a text from the markup
techcrunch_data["clean_text"] = [ BeautifulSoup(text, 'lxml').text for text in techcrunch_data["clean_text"]]

# remove unicode characters
techcrunch_data["clean_text"] = [unicodedata.normalize('NFKD', x) for x in techcrunch_data["clean_text"]]

# remove quotation marks
techcrunch_data["clean_text"] = [re.sub(r'[“”@()–-]+',' ', x) for x in techcrunch_data["clean_text"]]

# remove repeating spaces
techcrunch_data["clean_text"] = [re.sub(r'\s+',' ', x) for x in techcrunch_data["clean_text"]]

As the result the text looks much cleaner, and we can proceed with further processing and lemmatisation using SpaCy.

2. Cleaning data with SpaCy

SpaCy is one of the most popular NLP libraries, and is very fast and flexible.

I will use it for lemmatisation and to extract only nouns for my topics.

To speed up the processing, I will disable a Named Entity Recognition part of a SpaCy pipeline. The corpus is quite large, so I will use nlp.pipe to speed up the processing.

Additionally SpaCy will handle stop words removal, which are just most common words that don't have a lot of information value.

# loading the model
nlp = spacy.load("en_core_web_sm")

# remove ner function to speed up the processing
nlp.disable_pipes('ner')

# pipeline is a series of functions that is applied to a text
print(nlp.pipe_names)

# ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'] 

# defining the dataset
dataset = techcrunch_data['clean_text']

docs = []

# nlp.pipe makes processing more efficient
for text in tqdm(nlp.pipe(dataset), total=len(dataset)):
    doc = nlp(text) 
    pt = [token.lemma_.lower() for token in doc if
           (len(token.lemma_) > 1 and token.pos_ == "NOUN" and 
          not token.is_stop)]
    docs.append(pt)

3. Words removal

We already removed stop words with SpaCy, however there are further processing I decided to do to improve the accuracy of the topics model.

I removed words that appear just once in the whole corpora, as they won't have any significance.

# count how many times a word occurs in our corpus
counts_word_occurence = Counter(chain(*[x for x in data]))

# get a list of words that appeared only once in the whole corpus
low_freq_words = {key:value for (key,value) in counts_word_occurence.items() if value==1}

len(low_freq_words)
# 14272

# drop words that appear only once in the whole dataset
docs = [[lemma for lemma in text if counts_word_occurence[lemma]>1] for text in docs]

# drop words that appear only once in the whole dataset
docs = [[lemma for lemma in text if counts_word_occurence[lemma]>1] for text in docs]

Additionally, I've removed words that appear in the vast majority of the articles, like startup, founder and so on. All these news are about start ups, and I wanted to dive into more details.

# Dropping words that occur in more then 40% of the articles

# Getting the corpus length 
docs_length=len(docs)

# calculate in how many documents a word appeared
counts_word_percentage = Counter(chain(*[set(x) for x in docs]))

# calculate in what % of all articles a word appears
counts_word_percentage = {key:(value/docs_length)*100 for (key,value) in counts_word_percentage.items()}


# get words with high frequency
high_freq_words = {key:value for (key,value) in counts_word_percentage.items() if value>40}

high_freq_words

Now, when we have only noun lemmas for our corpus and removed all the words I decided not to use, let's look how our article looks like.

# our article transformed into a list of lemmas
pp = pprint.PrettyPrinter(compact=True)
pp.pprint(docs[25675])

Once we removed some words, the articles length has changed and it would be quite interesting to see its distribution.

lengths =  [len(x) for x in docs]
# Build the histogram, distribution of lemmas in the texts
sns.histplot(lengths)

The list with cleaned lemmas is available on my GitHub.

4. Topic modelling with the best number of topics.

One of the most popular algorithms for topic modelling is Latent Dirichlet allocation. It's a generative probabilistic model, based on Bayesian theorem.

In its essence, words that occur together repeatedly in a text corpora will be used to generate topics.

However, topic models are hard to interpret. Once of measures that give us some insight to our model is a Coherence score, which measures how subset of words fits together.

Gensim has a CoherenceModel method, so we could use it to evaluate the best number of topics for our corpus. I will try two methods, available in Gensim: 'u_mass', 'c_v'.

Let's write a function, that will calculate both 'u_mass' and 'c_v' coherence measures on our model for different number of topics.

# Defining dictionary and corpus with Gensim
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(text) for text in docs]

def calculate_coherence(dictionary, corpus, docs, start, stop):
    scores = []
    for topics in range(start, stop):

        # defining the model
        lda_model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=26, alpha='auto', eval_every=5)

        # U_mass coherence score
        cm_u_mass = CoherenceModel(model=lda_model, corpus=corpus, dictionary=dictionary, coherence='u_mass')
        u_mass_coherence = cm_u_mass.get_coherence()

        # C_v coherence score
        cm_c_v = CoherenceModel(model=lda_model, texts=docs, dictionary=dictionary, coherence='c_v')
        c_v_coherence = cm_c_v.get_coherence()

        values = [topics, u_mass_coherence, c_v_coherence]

        scores.append(values)

    return scores

# calculare scores
scores = calculate_coherence(dictionary, corpus, docs, 10, 30)

Having the scores will allow us to assess what number of topics that would give the best coherence.

# scores to df
df = pd.DataFrame(scores, columns = ['number_of_topics','u_mass_coherence','c_v_coherence'])

# tidying the df
df = df.melt(id_vars=['number_of_topics'], value_vars=['u_mass_coherence','c_v_coherence'])

# Plotting u_mass_coherence
sns.lineplot(data=df.loc[df['variable'] == 'u_mass_coherence'], x="number_of_topics", y="value").set_title('u_mass coherence')

# Plotting c_v_coherence
sns.lineplot(data=df.loc[df['variable'] == 'c_v_coherence'], x="number_of_topics", y="value").set_title('c_v coherence')

6. Visualising the results

According to our data the model that gives best coherence score has 22 topics. Previously, we already defined a corpus and a dictionary, so it's time to build the final model.

lda_model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=22, alpha='auto', eval_every=5)

# print topics
lda_model.print_topics(-1)

pyLDAvis library allows to visualise topics as bubbles with keywords, which is super cool if you want to explore them in more details.

pyLDAvis.enable_notebook()
viz = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
viz

7. Saving the model

Once the model is built, we could save it on disk for the future use on unseen data.

# save the model to the disk
temp_file = datapath('model')
lda_model.save(temp_file)

Link to GitHub: https://github.com/MariaZentsova/gensim-topics-startup-news

Useful resources:

Evaluate Topic Models: Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation

Exploring the Space of Topic Coherence Measures

Evaluation of topic modeling topic coherence

DEV Community

Topic modelling with Gensim and SpaCy on startup news

1. Text preprocessing

2. Cleaning data with SpaCy

3. Words removal

4. Topic modelling with the best number of topics.

6. Visualising the results

7. Saving the model

Useful resources:

Top comments (0)

Read next

Advent of Code 2024 - Day 14 : Restroom Redoubt

Brain-Inspired Method Cuts Neural Networks by 90% Without Losing Accuracy

React JS vs Python: How to Choose the Best Fit for Your Project

Master CSS Selectors: The Complete Beginner-to-Expert Guide