DEV Community

Discussion on: Build a quick Summarizer with Python and NLTK

Collapse
 
sebastiannielsen profile image
Sebastian-Nielsen • Edited

Cool website you got yourself there!

I got a question I forgot to ask. Why do you turn the 'stopwords' list into a set()? First I thought it was because you properly intented to remove duplicate items from the list, but then it stroke me.. Why would there be duplicate items in a corpus list containing stop words? When I compared the length of the list before and after turning it into a set. There was no difference:

len(stopwords.words("english") == len(set(stopwords.words("english")))
Outputs: True

Tracing the variable throughout the script, I most admit, I can not figure out why you turned it into a set. I assume it is a mistake?
Or do you have any specific reason for it?

  • by the way, thanks for the TFIDF suggestion, I am currently working on improving the algorithm by implementing the tfidf concept.
Thread Thread
 
davidisrawi profile image
David Israwi

Hmm, I believe the first time I used the list of stop words from NLTK there were some duplicates, if not I am curious too, lol. It may be time to change it to a list.

Thanks for the note!

If you ever try your implementation using TFIDF, let me know how it goes.