DEV Community

Apify for Apify

Posted on • Originally published at blog.apify.com on

Text and token classification in NLP

Following our introduction to this Hugging Face series, we'll focus on Natural Language Processing (NLP) tasks in this blog post.

Before going further, lets import the respective libraries. We'll need transformers for every task and sentencepiece is a library we need specifically for NLP tasks.

!pip install transformers!pip install sentencepiece
Enter fullscreen mode Exit fullscreen mode

Text classification

A text can be classified according to a number of criteria. For example, a reader may be evaluating whether this passage sounds positive or negative or whether it's grammatically correct.

In order to use text classifiers, we need to import the respective pipeline (text-classification).

from transformers import pipelinetextClassifier = pipeline("text-classification")
Enter fullscreen mode Exit fullscreen mode

By default, it uses a DistilBERT model. It's a general model but can be used for sentiment classification, i.e. whether the text input is positive, negative, or neutral.

textClassifier("He was over the moon to hear the good news.")
Enter fullscreen mode Exit fullscreen mode

Output:

[{'label': 'POSITIVE', 'score': 0.9995835423469543}]
Enter fullscreen mode Exit fullscreen mode

It's pretty straightforward, so no prizes for guessing it's a positive text. But the confidence score (>99%) is quite impressive here. Lets try some other examples.

textClassifier("As you are aware 2023 has been a challenging year. The continuous stream of unprecedented global challenges, surging energy prices and volatile market conditions have had a significant impact on the logistics industry. The situation is particularly more concerning for Pakistan whereby economic uncertainty combined with rising inflation, elevated operational cost, and fluctuations in currency exchange rates have affected businesses across various sectors, including DHL Express.")
Enter fullscreen mode Exit fullscreen mode

Output:

[{'label': 'POSITIVE', 'score': 0.9881656169891357}]
Enter fullscreen mode Exit fullscreen mode

Wow! Thats pretty optimistic of the model to declare it as positive, and with such high confidence.

textClassifier("Dont you think you would attract attention? said the Medical Man. Our ancestors had no great tolerance for anachronisms.")

[{'label': 'NEGATIVE', 'score': 0.9981589913368225}]
Enter fullscreen mode Exit fullscreen mode

Now, I'll try a bit of a neutral sort of sentence and see how well this model does.

textClassifier("There were others coming, and presently a little group of perhaps eight or ten of these exquisite creatures were about me. One of them addressed me")
Enter fullscreen mode Exit fullscreen mode

Output:

[{'label': 'POSITIVE', 'score': 0.9976465106010437}]
Enter fullscreen mode Exit fullscreen mode

It sounds like a bipolar sort of model. Now, we'll try another model for checking if a sentence/text is grammatically correct.

Importing the sentiment-analysis pipeline also yields the same model as above.

Grammatical correctness

For grammatical correctness, we have a model trained on CoLA (I'll talk about it in a while). Lets test this.

It returns label_0 for unacceptable and label_1 for acceptable.

grammaticalClassifier = pipeline("text-classification", model="textattack/distilbert-base-uncased-CoLA")
Enter fullscreen mode Exit fullscreen mode

Lets try it out a bit.

Test 1:

grammaticalClassifier("It surprises me a lot when I sees the images of airlines flying the empty flights.")# Output: [{'label': 'LABEL_1', 'score': 0.9570088982582092}]
Enter fullscreen mode Exit fullscreen mode

Test 2:

grammaticalClassifier("you doesn't deserve this after what have you went through")# Output: [{'label': 'LABEL_0', 'score': 0.5380246639251709}]
Enter fullscreen mode Exit fullscreen mode

Test 3:

grammaticalClassifier("I doesn't understand why schools is need to be closed in the summers.")# Output: [{'label': 'LABEL_0', 'score': 0.8870238065719604}]
Enter fullscreen mode Exit fullscreen mode

Coming back to the Corpus of Linguistic Acceptability (CoLA), this is a standard dataset used to train grammatical (correctness) models.

While I'm not completely sure, my intuition says that this corpus would have been used by grammatical checkers (like Grammarly), though it would have been fine-tuned by the heaps of data they acquire from users.

👨💻 Apify makes it easy to get data from the web for your LLMs and generative AI models.

Natural Language Inference (NLI)

In NLI, we'll check if a couple of statements confirm or contradict each other.

By the way, before importing a new model, it would be nice to delete the unused models. For NLI, the de facto model is RoBERTa.

del textClassifierdel grammaticalClassifier

nliClassifier = pipeline("text-classification", model="roberta-large-mnli")
Enter fullscreen mode Exit fullscreen mode

Test 1:

nliClassifier("Yemen has five sites on the list of World Heritage Sites. The first site from Yemen on the list, the Old Walled City of Shibam was designated in 1982.")# Output: [{'label': 'NEUTRAL', 'score': 0.9883605241775513}]
Enter fullscreen mode Exit fullscreen mode

Test 2:

nliClassifier("South Africa is one of the best cricket teams in the world. South Africa hasn't won any world cup yet.")# Output: [{'label': 'CONTRADICTION', 'score': 0.912165105342865}]
Enter fullscreen mode Exit fullscreen mode

Token classification

There can be situations where we need to pluck the information specifically for the words in a text. Determining the parts of speech within a sentence is a task requiring fine-grained classification of the specific words rather than the sentence as a whole. This is where token classification comes in handy.

Hugging Face provides a number of existing tokenizers to choose from. We can use any of them for the task.

PoS tagging

Token classification can be pretty useful for the parts of speech tagging.

tokenClassifier = pipeline("token-classification", model = "vblagoje/bert-english-uncased-finetuned-pos")

tokenClassifier("A cat is sitting on the table.")
Enter fullscreen mode Exit fullscreen mode

Output:

[{'entity': 'DET', 'score': 0.9995196, 'index': 1, 'word': 'a', 'start': 0, 'end': 1}, {'entity': 'NOUN', 'score': 0.99896586, 'index': 2, 'word': 'cat', 'start': 2, 'end': 5}, {'entity': 'AUX', 'score': 0.9972844, 'index': 3, 'word': 'is', 'start': 6, 'end': 8}, {'entity': 'VERB', 'score': 0.99938405, 'index': 4, 'word': 'sitting', 'start': 9, 'end': 16}, {'entity': 'ADP', 'score': 0.99917114, 'index': 5, 'word': 'on', 'start': 17, 'end': 19}, {'entity': 'DET', 'score': 0.9995147, 'index': 6, 'word': 'the', 'start': 20, 'end': 23}, {'entity': 'NOUN', 'score': 0.9988354, 'index': 7, 'word': 'table', 'start': 24, 'end': 29}, {'entity': 'PUNCT', 'score': 0.9996613, 'index': 8, 'word': '.', 'start': 29, 'end': 30}]
Enter fullscreen mode Exit fullscreen mode

Named Entity Recognition (NER)

We can also use token classification for the named entity recognition, where we can identify the pronouns (names of persons, cities, countries, etc.). It can be either used with the token-classification pipeline or individually with its own pipeline (ner) as well.

namedEntityRecognizer = pipeline("ner")

namedEntityRecognizer("Mount Everest lies on the border of Nepal and China. It perplexes me to know that it is closer to equator than Lahore.")
Enter fullscreen mode Exit fullscreen mode

Output:

[{'entity': 'I-LOC', 'score': 0.6424599, 'index': 1, 'word': 'Mount', 'start': 0, 'end': 5}, {'entity': 'I-LOC', 'score': 0.8563523, 'index': 2, 'word': 'Everest', 'start': 6, 'end': 13}, {'entity': 'I-LOC', 'score': 0.9997261, 'index': 8, 'word': 'Nepal', 'start': 36, 'end': 41}, {'entity': 'I-LOC', 'score': 0.9998282, 'index': 10, 'word': 'China', 'start': 46, 'end': 51}, {'entity': 'I-LOC', 'score': 0.99936837, 'index': 28, 'word': 'Lahore', 'start': 111, 'end': 117}]
Enter fullscreen mode Exit fullscreen mode

As you can see, there are a number of variants in both text and token classification. Each of them is useful in its own way. That's it from me for now. See you in the next installment of this series: machine translation with Hugging Face.

Top comments (0)