DEV Community

Cover image for Data Science Simplified: Top 5 NLP tasks that use Hugging Face
Hunter Johnson for Educative

Posted on • Originally published at educative.io

Data Science Simplified: Top 5 NLP tasks that use Hugging Face

Hugging Face is a company devoted to the development of NLP technologies and the democratization of artificial intelligence through natural language technologies. Their teams have changed the way we approach NLP by providing easy-to-understand language model architectures.

The Hugging Face Transformers pipeline is an easy way to perform different NLP tasks. It can be used to solve a variety of NLP projects with state-of-the-art strategies and technologies.

Today, I want to introduce you to the Hugging Face pipeline by showing you 5 tasks you can achieve with their tools.

Today, we will go over:

1. Sentiment Analysis

Sentiment analysis refers to classifying a given text with POSITIVE or NEGATIVE labels based on their sentiment with a given probability score.

Here we will be giving two sentences and extracting their labels with a score based on probability rounded to 4 digits.

nlp = pipeline("sentiment-analysis")

#First Sentence
result = nlp("I love trekking and yoga.")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

#Second sentence
result = nlp("Racial discrimination should be outright boycotted.")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
Enter fullscreen mode Exit fullscreen mode

The output for the first sentence is:

label: POSITIVE, with score: 0.9992
Enter fullscreen mode Exit fullscreen mode

The output for the second sentence is:

label: NEGATIVE, with score: 0.9991
Enter fullscreen mode Exit fullscreen mode

2. Question Answering

Question Answering refers to an answer to a question based on the information given to the model in the form of a paragraph. That information provided is known as its context. The answer is a small portion of the same context.

Below, a paragraph about Prime Numbers is given as a context, and 2 questions are asked based on the context. This context paragraph is taken from the SQuAD database.

nlp = pipeline("question-answering")

context = r"""
The property of being prime (or not) is called primality.
A simple but slow method of verifying the primality of a given number n is known as trial division.
It consists of testing whether n is a multiple of any integer between 2 and itself.
Algorithms much more efficient than trial division have been devised to test the primality of large numbers.
These include the Miller–Rabin primality test, which is fast but has a small probability of error, and the AKS primality test, which always produces the correct answer in polynomial time but is too slow to be practical.
Particularly fast methods are available for numbers of special forms, such as Mersenne numbers.
As of January 2016, the largest known prime number has 22,338,618 decimal digits.
"""

#Question 1
result = nlp(question="What is a simple method to verify primality?", context=context)

print(f"Answer: '{result['answer']}'")

#Question 2
result = nlp(question="As of January 2016 how many digits does the largest known prime consist of?", context=context)

print(f"Answer: '{result['answer']}'")
Enter fullscreen mode Exit fullscreen mode

The answer to the first question is:

Answer: 'trial division'
Enter fullscreen mode Exit fullscreen mode

The answer to the second question is:

Answer: '22,3338,618'
Enter fullscreen mode Exit fullscreen mode

3. Text Generation

Text generation is one of the most popular NLP tasks. GPT-3 is a type of text generation model that generates text based on an input prompt.

Below, we will generate text based on the prompt A person must always work hard and. The model will then produce a short paragraph response. As you'll see, the output is not very coherent because the model has fewer parameters.

text_generator = pipeline("text-generation")

text= text_generator("A person must always work hard and", max_length=50, do_sample=False)[0]

print(text['generated_text'])
Enter fullscreen mode Exit fullscreen mode

The output for the above code is:

A person must always work hard and be prepared to do so.

The following are some of the things that you should do to help yourself:

1. Be prepared to work hard.

2. Be prepared to work hard.
Enter fullscreen mode Exit fullscreen mode

4. Summarization

Text summarization is the process of comprehending a large chunk of textual data and then responding with a brief summary of that data. Below we are getting a summary for a paragraph on the Apollo Mission.

summarizer = pipeline("summarization")

ARTICLE = """The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972.
First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space,
Apollo was later dedicated to President John F. Kennedy's national goal of "landing a man on the Moon and returning him safely to the Earth" by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. 
Project Mercury was followed by the two-man Project Gemini (1962–66). 
The first manned flight of Apollo was in 1968.
Apollo ran from 1961 to 1972, and was supported by the two-man Gemini program which ran concurrently with it from 1962 to 1966. 
Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions.
Apollo used Saturn family rockets as launch vehicles. 
Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973–74, and the Apollo–Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975.
 """

summary=summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)[0]

print(summary['summary_text'])
Enter fullscreen mode Exit fullscreen mode

The summary generated for the above paragraph is:

The Apollo program, also known as Project Apollo, was the third U.S. human spaceflight program carried out by the National Aeronautics and Space Administration (NASA) The first manned flight of Apollo was in 1968. The program was dedicated to President Kennedy's national goal of "landing a man on the Moon and returning him safely to the Earth"
Enter fullscreen mode Exit fullscreen mode

5. Translation

Translation is the process of translating one language to another. NLP is used to generate automatic translations between languages. Below, we will translate a proverbial sentence from English to German.

translator = pipeline("translation_en_to_de")

print(translator("A great obstacle to happiness is to expect too much happiness.", max_length=40)[0]['translation_text'])
Enter fullscreen mode Exit fullscreen mode

The translated sentence is:

Ein großes Hindernis für das Glück besteht darin, zu viel Glück zu erwarten. 
Enter fullscreen mode Exit fullscreen mode

What to learn next

NLP is a powerful tool, and there is so much to learn. If you are interested in exploring NLP on your own or designing projects using Hugging Face, consider starting with the following concepts:

  • Embeddings
  • Language Models
  • Bidirectional LSTM
  • Seq2Seq Models

Check out Educative's course Natural Language Processing with Machine Learning to get started with these topics and beyond. You'll learn the techniques for processing text data, creating word embeddings, and using LSTM networks for NLP tasks. After completing this course, you will be able to solve important day-to-day NLP problems on your own.

Happy learning!

Continue reading about NLP and ML on Educative

Start a discussion

What is your personal favorite tool (or least favorite) for NLP tasks? Was this article helpful? Let us know in the comments below!

Oldest comments (0)