Python Keywords Extraction - Machine Learning Project Series: Part 2

#python #machinelearning #programming #nlp

This article was originally published at https://programmerbackpack.com.

Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.

What is Keywords Extraction

Keywords extraction is a subtask of the Information Extraction field which is responsible for extracting keywords from a given text or from a collection of texts to help us summarize the content. This is useful in the context of the huge amount of information we deal with every day. We need to index this information, to organise it and retrieve it later. Keywords extraction becomes more and more important these days and keywords extraction algorithms are researched and improved continuously.

Today we are going to discuss about TextRank, one of the most famous algorithms for keywords extraction and text summarization and play with a short implementation in Python.

In the first article in this series I talked about starting a journey about studying Machine Learning by starting a personal project - a personal knowledge management system that can help me track the things I learn.

In the latest article in this series, we discussed about named entity recognition. Now, another must have functionality that I would like to have is the ability to automatically extract keywords from the content I save to my application. In this way, I can search for it easily in the future or I can organise my content faster and easier.

Understand TextRank for Keywords Extraction

If the name TextRank sounds familiar to you, that's because you may think of another famous algorithm, PageRank. And you'd be right to think of PageRank, because not only the name is inspired from it, but also the basic principles.

PageRank is an algorithm that Google uses to rank web pages against a user search query from a user. It is used to measure the importance of a web page by observing the links/references(by number and by quality and importance) between web pages. The assumption here is that the higher the number of references to a web page, then the more important should that web page be.

You can checkout the Wikipedia page for PageRank for a mathematical explanation of the PageRank algorithm if you're interested in more details, but the main takeaway for this is: more important web pages are referenced by important web pages. By applying the PageRank algorithm, we estimate the probability that a user will click access a given web page.

Now back to TextRank, the same logic is applied. TextRank is a graph-based algorithm and we will represent the data like this:

The nodes in the graph will be represented by the words in our text
The vertices of the graph will be represented by the similarity scores between two given nodes(so two given words in our text)

Basically, the steps for applying the TextRank algorithm are the following:

Split the whole text into words
Calculate word embeddings using any word embedings representation
Calculate similarity scores choosing any similarity metric based on the word embeddings you obtained in the previous step
Build the graph using the words as nodes and similarity scores as vertices
Get the first n words, choosing n to serve your purposes

The math beyond the TextRank algorithm is beyond the scope of this article, because we would also like to play with this algorithm for a little bit. If you think I should write a more detailed blog post about this algorithm, please let me know and I'll gladly do. Now let's move on to the fun stuff.

Python Keyword Extraction using Gensim

Gensim is an open-source Python library for usupervised topic modelling and advanced natural language processing. It is very easy to use and very powerful, making it perfect for our project. This library contains a TextRank implementation that we can use with very few lines of code.

One important thing to note here is that at the moment the Gensim implementation for TextRank only works for English. During the TextRank algorithm words are stemmed and stopwords are removed and this is a language-dependend process, and so the library only contains the implementation for English.

We need very few dependencies installed for this project.

pip3 install gensim
pip3 install networks
pip3 install matplotlib

Next up we should import everything we need for this project. We will use the keywords method from gensim for extracting the keywords and the get_graph to method to display a graph of our text. Next up, matplotlib and networkx are used for visualisation purposes.

from gensim.summarization import keywords
from gensim.summarization.keywords import get_graph
import networkx as nx
import matplotlib.pyplot as plt

Getting the keywords of a text with Gensim is very easily, it's actually a matter of two lines of code. To prove how well this algorithm works, I will provide as a text input the first paragraph of this blog post, the one in which we talk about keywords extraction. This text is returned by the get_text() method.

if __name__=="__main__":

    text = get_text()
    print (keywords(text).split('\n'))

And the result is as follows:

['extracting', 'keywords extraction']

As you can see, the two keywords(or keyphrases) are exactly what I would like to obtain for a paragraph like the given one. The text is exactly about keywords extraction and that is what I obtained.

We now can build the graph of the text so that we can see how our words are related to each other? Then we can use this code.

def displayGraph(textGraph):

    graph = nx.Graph()
    for edge in textGraph.edges():
        graph.add_node(edge[0])
        graph.add_node(edge[1])
        graph.add_weighted_edges_from([(edge[0], edge[1], textGraph.edge_weight(edge))])

        textGraph.edge_weight(edge)
    pos = nx.spring_layout(graph)
    plt.figure()
    nx.draw(graph, pos, edge_color='black', width=1, linewidths=1,
            node_size=500, node_color='seagreen', alpha=0.9,
            labels={node: node for node in graph.nodes()})
    plt.axis('off')
    plt.show()

if __name__=="__main__":

    text = get_text()
    displayGraph(get_graph(text))

And the result should look like this. Don't worry if the words seem a little incomplete to you. It's the result of the stemming in other transformations Gensim does during the TextRank algorithm.

What about if we try with the whole text of this blog?

['python', 'algorithms', 'algorithm', 'extraction', 'extracting', 'extract', 'textrank', 'web', 'words', 'word', 'given text', 'blog', 'continuously', 'steps', 'step', 'machine', 'like', 'topic', 'article', 'entity', 'search', 'language', 'keywords', 'keyword', 'implementation', 'texts']

Then we get a whole lot more keywords, but keep in mind they are ordered by importance. So if we just wanted to extract only 3 words for this blog post, after removing the duplicates, we would have got: python, algorithms and extraction. Sounds like a great result to me!

Wrapping up

So today we discussed a bit about the TextRank algorithm for extracting keywords from a given text. We saw some fundamental principles behind the algorithm and played with an implementation in the Gensim library. And I feel we actually got good results!

Thank you so much for reading this! Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.