Basic NLP with Python

#tutorial #python #nlp #machinelearning

NLP

Natural language processing (NLP) is a subfield of artificial intelligence and linguistics that focuses on the interaction between computers and human (natural) languages. NLP techniques are used to process and analyze large amounts of text data, and to enable computers to understand, generate, and manipulate human language.

Some examples of NLP tasks include:

Sentiment analysis: determining the sentiment (positive, negative, or neutral) of a piece of text
Named entity recognition: identifying and labeling named entities (e.g. people, organizations, locations) in text
Part-of-speech tagging: labeling words in text according to their part of speech (e.g. noun, verb, adjective)
Machine translation: translating text from one language to another
Text summarization: creating a condensed version of a text document that retains its key information and ideas

NLP has a wide range of applications in various fields, including social media, customer service, marketing, and healthcare. It is an active area of research and development in both academia and industry.

scikit-learn

scikit-learn is an open-source machine learning library for Python. It provides a range of tools and algorithms for supervised and unsupervised learning, as well as for preprocessing, model selection, and evaluation.

Some of the key features of scikit-learn include:

Support for a wide range of machine learning models, including linear models, decision trees, and clustering algorithms
A consistent and easy-to-use API that allows users to quickly build and experiment with different models
Built-in functions for splitting datasets into training and test sets, evaluating model performance, and tuning hyperparameters
Integration with other popular Python scientific computing libraries, such as NumPy, Pandas, and Matplotlib
scikit-learn is widely used by data scientists, researchers, and developers for a variety of machine learning tasks. It is a popular choice for building and deploying machine learning models in production environments.

sklearn.feature_extraction

For our example we use sklearn.feature_extraction.text is a module in the scikit-learn library for natural language processing (NLP) tasks in Python. It provides classes and functions for extracting features from text documents.

Some of the key classes and functions in the sklearn.feature_extraction.text module include:

CountVectorizer: This class converts a collection of text documents to a matrix of word counts. It can be used to extract features from text documents for use in machine learning models.
TfidfVectorizer: This class converts a collection of text documents to a matrix of TF-IDF (term frequency-inverse document frequency) values. It can be used to extract features from text documents and down-weight common words that appear in many documents.
text.TfidfTransformer: This class transforms a count matrix to a normalized TF-IDF representation. It can be used to apply TF-IDF weighting to a set of features extracted from text documents.
These classes and functions can be used in various NLP tasks, such as text classification, topic modeling, and keyword extraction.

Note that this is just a brief overview of the sklearn.feature_extraction.text module. There are many other features and capabilities that are not mentioned here. For more information, you can refer to the official documentation for the scikit-learn library.

# Import necessary modules
from sklearn.feature_extraction.text import CountVectorizer

# Define the documents to analyze
documents = [
    "This is a sentence about cats.",
    "This sentence is about dogs.",
    "This sentence is about birds and animals."
]

# Create the CountVectorizer object
vectorizer = CountVectorizer()

# Fit the vectorizer to the documents
vectorizer.fit(documents)

# Print the list of keywords
print(vectorizer.get_feature_names())

In this example, the CountVectorizer object is used to create a matrix of word counts for the given documents. The get_feature_names method is then used to print the list of keywords (i.e. the words in the documents that are not stop words).

The output of this code would be: ['about', 'animals', 'birds', 'cats', 'dogs', 'sentence'].

Note that this is just one way to implement keyword extraction in Python. There may be other approaches and libraries that you can use, depending on your specific requirements and preferences.

test

Here is an example of how you could add some tests to the keyword extraction code from the previous example:

# Import necessary modules
import unittest
from sklearn.feature_extraction.text import CountVectorizer

# Define the documents to analyze
documents = [
    "This is a sentence about cats.",
    "This sentence is about dogs.",
    "This sentence is about birds and animals."
]

class KeywordExtractorTest(unittest.TestCase):

    def test_keywords(self):
        # Create the CountVectorizer object
        vectorizer = CountVectorizer()

        # Fit the vectorizer to the documents
        vectorizer.fit(documents)

        # Get the list of keywords
        keywords = vectorizer.get_feature_names()

        # Assert that the correct keywords are extracted
        self.assertIn('about', keywords)
        self.assertIn('animals', keywords)
        self.assertIn('birds', keywords)
        self.assertIn('cats', keywords)
        self.assertIn('dogs', keywords)
        self.assertIn('sentence', keywords)

if __name__ == '__main__':
    unittest.main()

In this example, the KeywordExtractorTest class inherits from unittest. TestCase and defines a single test method: test_keywords. This method creates a CountVectorizer object, fits it to the documents, and extracts the keywords using the get_feature_names method. The test then uses the assertIn method to check that the correct keywords are extracted.

To run the tests, you can use the unittest.main method, which will automatically discover and run any test methods defined in the KeywordExtractorTest class.

Note that this is just one way to add tests to the keyword extraction code. There may be other ways to structure and organize the tests, depending on the specific requirements and constraints of your project.

Top comments (3)

Leonard Püttmann • Dec 13 '22

Great article Davide, I think it's a great start into NLP! TF-IDF is still useful today for a lot of tasks I think, even with all these powerful transformer models we have these days.