Discussion on: Build a quick Summarizer with Python and NLTK

View post

Excellent post, you are absolutely amazing ❤️
I got one question though, when adding up the sentenceValues, why would you like the key in the sentenceValue dictionary to only be the first 12 characters of the sentence? I mean it might cause some troubles if the sentence is lower than 12 characters or if two different sentences starts with the exact same 12 characters.

I assume you did it as a way to reduce overheat, but to be honest. Perfomance wise I don't think the difference would be that significant, I would much rather prefer:

Added readability (as you won't have to think about the [:12]
Errors with sentences less than 12 characters
Issues with two different sentences starting with the same 12 chars.

As a sacrifies for a tiny performance increase.

I would love to hear your opinion on this matter.

If anyone got any errors running the code, copy paste my version.

That said, it does not work properly, it has some flaws, I tried to summarize this article as a test. Here is the result: (The threshold is: (1.5*average) )

"For example, the Center for a New American Dream envisions "... a focus on more of what really matters, such as creating a meaningful life, contributing to community and society, valuing nature, and spending time with family and friends."

David Israwi • Mar 18 '18 • Edited

Thank you very much, Sebastian!

I agree with you -- having the whole sentence as the dictionary key will bring a better reliability to the program compared to the first 12 characters of the sentence, my decision was mainly regarding the overheat, but as you said: it is almost negligible. One bug that I would look for is the use of special characters in the text, mainly the presence of quotes and braces, but this is an easily fixable issue (I believe using the three quotes as you are currently doing will avoid this issue)

I summarized the same article and got the following summary:

It boldly proclaims: "We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. President Lincoln extended the American Dream to slaves with the Emancipation Proclamation. How the American Dream Changed Throughout U.S. history, the definition of happiness changed as well. After the 1920s, many presidents supported the idea of the Dream as a pursuit of material benefits. While running for President in 2008, Hillary Clinton proposed her American Dream Plan. Did the Great Recession Create a New American Dream? Some people think the Great Recession and rising income inequality spelled the end of the American Dream for many. Instead, many are turning to a new definition of the American Dream that better reflects the values of the country for which it was named. For example, the Center for a New American Dream envisions "... a focus on more of what really matters, such as creating a meaningful life, contributing to community and society, valuing nature, and spending time with family and friends." Financial adviser Suze Orman described the new American Dream as one "... where you actually get more pleasure out of saving than you do spending. (Source: Suze Orman on the New American Dream, ABC.) Both of these new visions reject the American Dream based on materialism. But perhaps there is no need to create a New American Dream from scratch.

Feel free to use my version for comparison!

How short your summary was may be a result of the way you are using the Stemmer, I would suggest testing the same article without it to verify this. Besides that, your code is looking on point -- clean and concise. If you are looking for ways to improve your results, I would suggest you explore the following ideas:

Having a variable threshold
Using TFIDF instead of our word value algorithm (not sure if it'll bring better results but worth the try)
Having some kind of derivated value added from the previous sentence for consistency

Thanks for the suggestion!

Sebastian-Nielsen • Mar 18 '18 • Edited

Cool website you got yourself there!

I got a question I forgot to ask. Why do you turn the 'stopwords' list into a set()? First I thought it was because you properly intented to remove duplicate items from the list, but then it stroke me.. Why would there be duplicate items in a corpus list containing stop words? When I compared the length of the list before and after turning it into a set. There was no difference:

len(stopwords.words("english") == len(set(stopwords.words("english")))
Outputs: True

Tracing the variable throughout the script, I most admit, I can not figure out why you turned it into a set. I assume it is a mistake?
Or do you have any specific reason for it?

by the way, thanks for the TFIDF suggestion, I am currently working on improving the algorithm by implementing the tfidf concept.

David Israwi • Mar 23 '18

Hmm, I believe the first time I used the list of stop words from NLTK there were some duplicates, if not I am curious too, lol. It may be time to change it to a list.

Thanks for the note!

If you ever try your implementation using TFIDF, let me know how it goes.