If you're interested in Data Analytics, you will find learning about Natural Language Processing very useful. A good project to start learning about NLP is to write a summarizer - an algorithm to reduce bodies of text but keeping its original meaning, or giving a great insight into the original text.
There are many libraries for NLP. For this project, we will be using NLTK - the Natural Language Toolkit.
Let's start by writing down the steps necessary to build our project.
4 steps to build a Summarizer
- Remove stop words (defined below) for the analysis
- Create frequency table of words - how many times each word appears in the text
- Assign score to each sentence depending on the words it contains and the frequency table
- Build summary by adding every sentence above a certain score threshold
That's it! And the Python implementation is also short and straightforward.
What are stop words?
Any word that does not add a value to the meaning of a sentence. For example, let's say we have the sentence
A group of people run every day from a bank in Alafaya to the nearest Chipotle
By removing the sentence's stop words, we can narrow the number of words and preserve the meaning:
Group of people run every day from bank Alafaya to nearest Chipotle
We usually remove stop words from the analyzed text as knowing their frequency doesn't give any insight to the body of text. In this example, we removed the instances of the words a, in, and the.
Now, let's start!
There are two NLTK libraries that will be necessary for building an efficient summarizer.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
Note: There are more libraries that can make our summarizer better, one example is discussed at the end of this article.
Corpus
Corpus means a collection of text. It could be data sets of poems by a certain poet, bodies of work by a certain author, etc. In this case, we are going to use a data set of pre-determined stop words.
Tokenizers
Basically, it divides a text into a series of tokens. There are three main tokenizers - word, sentence, and regex tokenizer. For this specific project, we will only use the word and sentence tokenizer.
Removing stop words and making frequency table
First, we create two arrays - one for stop words, and one for every word in the body of text.
Let's use text
as the original body of text.
stopWords = set(stopwords.words("english"))
words = word_tokenize(text)
Second, we create a dictionary for the word frequency table. For this, we should only use the words that are not part of the stopWords array.
freqTable = dict()
for word in words:
word = word.lower()
if word in stopWords:
continue
if word in freqTable:
freqTable[word] += 1
else:
freqTable[word] = 1
Now, we can use the freqTable
dictionary over every sentence to know which sentences have the most relevant insight to the overall purpose of the text.
Assigning a score to every sentence
We already have a sentence tokenizer, so we just need to run the sent_tokenize()
method to create the array of sentences. Secondly, we will need a dictionary to keep the score of each sentence, this way we can later go through the dictionary to generate the summary.
sentences = sent_tokenize(text)
sentenceValue = dict()
Now it's time to go through every sentence and give it a score depending on the words it has. There are many algorithms to do this - basically, any consistent way to score a sentence by its words will work. I went for a basic algorithm: adding the frequency of every non-stop word in a sentence.
for sentence in sentences:
for wordValue in freqTable:
if wordValue[0] in sentence.lower():
if sentence[:12] in sentenceValue:
sentenceValue[sentence[:12]] += wordValue[1]
else:
sentenceValue[sentence[:12]] = wordValue[1]
Note: Index 0 of wordValue will return the word itself. Index 1 the number of instances.
If sentence[:12]
caught your eye, nice catch. This is just a simple way to hash each sentence into the dictionary.
Notice that a potential issue with our score algorithm is that long sentences will have an advantage over short sentences. To solve this, divide every sentence score by the number of words in the sentence.
So, what value can we use to compare our scores to?
A simple approach to this question is to find the average score of a sentence. From there, finding a threshold will be easy peasy lemon squeezy.
sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]
# Average value of a sentence from original text
average = int(sumValues/ len(sentenceValue))
So, what's a good threshold? The wrong value could give a summary that is too small/big.
The average itself can be a good threshold. For my project, I decided to go for a shorter summary, so the threshold I use for it is one-and-a-half times the average.
Now, let's apply our threshold and store our sentences in order into our summary.
summary = ''
for sentence in sentences:
if sentence[:12] in sentenceValue and sentenceValue[sentence[:12]] > (1.5 * average):
summary += " " + sentence
You made it!! You can now print(summary)
and you'll see how good our summary is.
Optional enhancement: Make smarter word frequency tables
Sometimes, we want two very similar words to add importance to the same word, e.g., mother, mom, and mommy. For this, we use a Stemmer - an algorithm to bring words to its root word.
To implement a Stemmer, we can use the NLTK stemmers' library. You'll notice there are many stemmers, each one is a different algorithm to find the root word, and one algorithm may be better than another for specific scenarios.
from nltk.stem import PorterStemmer
ps = PorterStemmer()
Then, pass every word by the stemmer before adding it to our freqTable
. It is important to stem every word when going through each sentence before adding the score of the words in it.
And we're done!
Congratulations! Let me know if you have any other questions or enhancements to this summarizer.
Thanks for reading my first article! Good vibes
Top comments (37)
Excellent post, you are absolutely amazing ❤️
I got one question though, when adding up the sentenceValues, why would you like the key in the sentenceValue dictionary to only be the first 12 characters of the sentence? I mean it might cause some troubles if the sentence is lower than 12 characters or if two different sentences starts with the exact same 12 characters.
I assume you did it as a way to reduce overheat, but to be honest. Perfomance wise I don't think the difference would be that significant, I would much rather prefer:
[:12]
As a sacrifies for a tiny performance increase.
I would love to hear your opinion on this matter.
If anyone got any errors running the code, copy paste my version.
That said, it does not work properly, it has some flaws, I tried to summarize this article as a test. Here is the result: (The threshold is: (1.5*average) )
"For example, the Center for a New American Dream envisions "... a focus on more of what really matters, such as creating a meaningful life, contributing to community and society, valuing nature, and spending time with family and friends."
Thank you very much, Sebastian!
I agree with you -- having the whole sentence as the dictionary key will bring a better reliability to the program compared to the first 12 characters of the sentence, my decision was mainly regarding the overheat, but as you said: it is almost negligible. One bug that I would look for is the use of special characters in the text, mainly the presence of quotes and braces, but this is an easily fixable issue (I believe using the three quotes as you are currently doing will avoid this issue)
I summarized the same article and got the following summary:
Feel free to use my version for comparison!
How short your summary was may be a result of the way you are using the Stemmer, I would suggest testing the same article without it to verify this. Besides that, your code is looking on point -- clean and concise. If you are looking for ways to improve your results, I would suggest you explore the following ideas:
Thanks for the suggestion!
Cool website you got yourself there!
I got a question I forgot to ask. Why do you turn the 'stopwords' list into a
set()
? First I thought it was because you properly intented to remove duplicate items from the list, but then it stroke me.. Why would there be duplicate items in a corpus list containing stop words? When I compared the length of the list before and after turning it into a set. There was no difference:len(stopwords.words("english") == len(set(stopwords.words("english")))
Outputs: True
Tracing the variable throughout the script, I most admit, I can not figure out why you turned it into a set. I assume it is a mistake?
Or do you have any specific reason for it?
Hmm, I believe the first time I used the list of stop words from NLTK there were some duplicates, if not I am curious too, lol. It may be time to change it to a list.
Thanks for the note!
If you ever try your implementation using TFIDF, let me know how it goes.
Excellent post!
@davidisrawi can you please help me with text extraction (3-4 Keywords) from each paragraph from an article.
I went through your article i got stuck with an Error "string index out of range".
sentenceValue[sentence[:12]] += wordValue[1]
IndexError: string index out of range
I have tried changing [sentence[:12]] to 7,8,9 but unable to resolve my error.
Please help regarding this
Thank you very much Dhairya!
This bug could happen whenever you list of sentences contains one of length < 12. A good workaround is to remove the
[:12]
index completely and use the whole sentence as thesentenceValue
keys. Does that make sense? So instead it would be:Let me know if that fixes the problem!
I have changed it but it is still giving me some KeyError: and showing 3-4 starting lines from my text.
How to solve this error ?
Hm sounds like you may have forgotten to remove the [:12] index from the other parts of your code where you use
sentenceValue
, maybe that is the issue? If not, feel free to share a snippet of your code so we can be on the same page.Hi, I am still facing the same error index out of range...help me on this
The code should be like this..
for sentence in sentences:
for wordValue in freqTable:
if wordValue in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freqTable[wordValue]
else:
sentenceValue[sentence] = freqTable[wordValue]
sentenceValue[sentence[:12]] += wordValue[1]
string index out of range
Hey! You may have a sentence that is lower than 12 characters. In this case, you can set the index of the word value to sentence[:10], or a lower number depending on your shortest sentence.
Lowering the number of characters used to hash the sentence value can bring some issues -- two sentences with the same 7,8,9 starting characters will then store/retrieve their value from the same index on the dictionary, that's why it's important to keep the sentence length for hashing as high as you can
any number i use gives the same error
sentenceValue[sentence[:2]] += wordValue[1]
IndexError: string index out of range
Interesting. For debugging it, I would print all your sentences and find if there's an empty one (or a very short one), I think that may be the issue.
Let me know if that works or if you found the issue
i am also facing string index out of range problemm.,,what is the issue
You may have a string that is less than your string length sentence[:2]. I would recommend printing the strings and see if this is the case
i have solved it..it was acctually puncutation problem..in my case...i just handle dot(.) character while giving words as value..
Could you please help me , i'm facing same problem here and i can't handle it , tahnk you.
Is this your error?
IndexError: string index out of range
If so, potential solutions could be:
If that doesn't solve it, let me know!
It is still giving me an error when the text is > than 12 characters long and the sentance is (when printed through the loop) "You notice a wall of text in twitch chat and your hand instinctively goes to the mouse.", which is the first line in the paragraph. I found that even when you take out the range the same error occurs.
The bug may be in how you are storing your sentences, make sure you print out the sentences as you store them instead of when you retrieve them, hopefully, that'll help you find the issue. If not, let me know if I can help!
Thanks @david Israwi for this simple and interesting text summarizer program.
I see and analyze your code.
The most error i found is
index out of range
and most of the people seem to have the same error a lot.The one thing i am confuse in this part of code:
why and how the only 1.5 average is used.
How about the the large one line text instead it not summarize it.
For example:
I am using python 3 and i resolve the
index out of range
error as:Thanks a lot! This post is really helpful! If you have other resources including making chatbot can be really helpful to me.
I am little bit interesting about how to implement the Text Summarizer using machine learning model. I am looking for this too...
You can directly send information at
sushant1234gautam@gmail.com
Great post David,
I have been trying to wrap my head around machine learning and NLP for a few months now. Developing intuition has been a slow process. Article like yours are a sources of "aha moments"/. I am trying to build a blog post summary app. Being a newbie I am using an API (AYLIEN) and following this summary generator tutorial. Having something working gives me motivation to read in-depth articles.
Thanks for your comments Vikram, best of luck with the summary app!
This isn't working right for me and I think it comes down to wordValue[0] not working for me the way you said. Do you know why that could be?
Like if I do:
for wordValue in freqTable:
print (wordValue[0])
I only get the first letters:
q
b
f
j
m
.
s
b
s
l
It seems like your bug comes from separating the paragraphs into letters instead of words.
The program should do the following commands in the respective order:
I wouldn't be able to know in which step the bug is, but it seems as if you are finding the frequency of each letter instead of each word, make sure you are keeping track of your arrays by printing them through your code, seems like you're almost there
I too am confused about this. According to my understanding, when we use the in operator in a dictionary, it only iterates through the keys. Therefore it would make sense that the program is printing only the first letter.
I guess to get the key-value pairs, we need to use the items() function as:
for wordValue in freqTable.items():
Hello sir, could you suggest a way to make the summarizer more efficient. Sometimes a few sentences with lower sentence values can be very important for the summary. In that case if few leave those out, the summary may not make sense
That's a good point. I think what you might be referring to is some kind of an adjacency value - this sentence might be worth more than we think because it's next to this really important sentence.
Another aspect you could change in the scoring algorithm is the use of TF-IDF. Let me know if you end up using it, I would like to see how that would look like
In Python, a string is a single-dimensional array of characters. The string index out of range means that the index you are trying to access does not exist. In a string, that means you're trying to get a character from the string at a given point. If that given point does not exist , then you will be trying to get a character that is not inside of the string. Indexes in Python programming start at 0. This means that the maximum index for any string will always be length-1. There are several ways to account for this. Knowing the length of your string (using len() function)could certainly help you to avoid going over the index.
Hi..
I have a problem with that :
sumValues += sentenceValue[sentence]
TypeError: unsupported operand type(s) for +=: 'int' and 'str'
Hi Viqi. Seems like you are storing a string in your sentenceValue dictionary instead of an actual value, it is supposed to be an int instead. Fixing that may solve the problem!
where we have import text file to run this?
That would be up to you!
In my implementation, I put everything in one method, so then I can just run it through the command line passing the actual string of text. Having said that, it totally depends on your use or implementation, in some cases it, might be worth to receive the text file instead
Here is my imp if you want to take a look at it: github.com/DavidIsrawi/SummarizeMe...
Some comments may only be visible to logged-in visitors. Sign in to view all comments.