DEV Community

Shun Yamada
Shun Yamada

Posted on

How to extract high-frequency words in NLTK

While reading an official document for NLTK(Natural Language Toolkit), I tried extracting words which are frequently used in a sample text. This time, I tried to let the most frequency three words be in a display.

Development

  • Python
  • NLTK

Install NLTK

$ pip install nltk
Enter fullscreen mode Exit fullscreen mode

Extract High-frequency words

Let me the coding begins. You should download punkt and averaged_perception_tagger initially for running word-tokenizing a part-of-speech acquisition. Next, read a sample text, and convert it to word-separation from text. And remove non-Noun things from this result. Finally, get the most frequent words.

Download

import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
Enter fullscreen mode Exit fullscreen mode

Import nltk, and then download punkt and averaged_perception_trigger. Once downloaded in the environment, you don't have to do it again.

Convert texts to word-tokenizing

raw = open('sample.txt').read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

tokens_l = [w.lower() for w in tokens]
Enter fullscreen mode Exit fullscreen mode

Prepare some essays or long texts. After reading this, it should be word-tokenized. Then, set up capital cases to lower cases, they should be recognized as the same.

Extract only Noun

only_nn = [x for (x,y) in pos if y in ('NN')]

freq = nltk.FreqDist(only_nn)
Enter fullscreen mode Exit fullscreen mode

Remove non-noun words from this result. And calculate how frequency these words are included.

Get the most frequent three words

print(freq.most_common(3))
Enter fullscreen mode Exit fullscreen mode

After counting frequent words, you can get the top three ones by most_common().

Top comments (0)