DEV Community

Cover image for Getting started with Hugging Face: A Machine Learning tutorial in Python
Ecaterina Teodoroiu
Ecaterina Teodoroiu

Posted on

Getting started with Hugging Face: A Machine Learning tutorial in Python

AI Open source communities

There have been many developments in the open source artificial intelligence (AI) community over the past few years.

Image description
Some of the most significant trends include:
Increasing adoption of AI and machine learning (ML) techniques in a variety of industries, including healthcare, finance, and retail. This has led to a growing demand for tools and frameworks that can help developers build and deploy AI and ML models.

The emergence of new open source libraries and frameworks for building and training AI models, such as TensorFlow, PyTorch, and scikit-learn. These libraries have become popular choices for developers due to their ease of use and strong support for a wide range of AI and ML applications.

The development of new tools and platforms for managing and deploying AI models in production, such as Kubernetes, Hugging Face and OpenAI's own GPipe. These tools have made it easier for organizations to scale their AI deployments and ensure that they are able to handle the demands of real-time, production environments.

The growth of online communities and resources for AI developers, such as forums, blogs, and meetups. These communities provide a way for developers to learn from one another and stay up-to-date on the latest trends and best practices in the field.

Overall, it is an exciting time for the AI community, as advances in technology and open source tools continue to drive innovation and enable the development of new applications and solutions.

Hugging Face

is a prominent name in the artificial intelligence (AI) open source landscape because it is a leader in developing and promoting natural language processing (NLP) tools. NLP is a subfield of AI that deals with the interaction between computers and human languages, and it is an important area of research and development in the field.

Image description
Hugging Face is known for its popular open source library called "Transformers," which is a collection of pre-trained models and tools for NLP tasks such as language translation, text summarization, and question answering. These models are trained on large datasets and are able to perform many NLP tasks with high accuracy and efficiency.

In addition to its work on the Transformers library, Hugging Face is also known for its contributions to the broader AI community, including hosting workshops and conferences, and providing resources and support for developers. Its open source approach and commitment to advancing the field of NLP have helped make it a prominent name in the AI community.

How to use Hugging Face with Python?

To use Hugging Face's natural language processing (NLP) library with Python, you will need to install the library and its dependencies. Here are the steps you can follow to get started:

Install the Transformers library using pip. Open a terminal or command prompt, and enter the following command:

pip install transformers

Enter fullscreen mode Exit fullscreen mode

This will install the Transformers library and its dependencies.

Import the library in your Python code. To use the Transformers library in your Python code, you will need to import it. You can do this by adding the following line at the top of your Python script:

import transformers

Enter fullscreen mode Exit fullscreen mode

Choose a pre-trained model and task. The Transformers library includes a wide range of pre-trained models that can be used for a variety of NLP tasks, such as language translation, text classification, and question answering. You will need to choose a model that is appropriate for your task. You can find a list of available models and their corresponding tasks on the Hugging Face documentation page.

Load the model and use it to perform the task. Once you have chosen a model and task, you can use the Transformers library to load the model and use it to perform the task. For example, to perform language translation with a model called bert-base-cased, you could use the following code:

# Load the model
model = transformers.BertModel.from_pretrained('bert-base-cased')

# Define the input text and the target language
input_text = "Hello, how are you today?"
target_language = "fr"

# Use the model to translate the text
output = model.translate(input_text, target_language)

print(output)

Enter fullscreen mode Exit fullscreen mode

This code will load the

bert-base-cased model and use it to translate the input text from English to French.
Let’s see how we can use Hugging Face with simple API calls.

Using Hugging Face, we will be performing Named Entity Recognition.

An example of a machine learning tutorial for Named Entity Recognition (NER) using Hugging Face

1.Introduction:

Named Entity Recognition (NER) is a subfield of Natural Language Processing (NLP) that focuses on identifying named entities (such as people, organizations, and locations) in text. NER is useful for a wide range of applications, including information extraction, information retrieval, and question answering. In this tutorial, we will use Hugging Face, a popular NLP library, to build and train a machine learning model for NER.

2.Data Preprocessing:

Before we can begin training a machine learning model for NER, we need to preprocess our data. This typically involves the following steps:

  • Tokenization:
    Split the text into individual words or subwords (called tokens).

  • Part-of-speech tagging: Identify the part of speech (e.g., noun, verb, adjective) for each token.

  • Chunking: Group tokens into larger chunks (called chunks or named entity chunks).

  • Labeling: Assign a label to each chunk, indicating the type of named entity it represents (e.g., person, organization, location).
    In Hugging Face, we can use the Transformers library to perform these preprocessing steps. For example, to tokenize a piece of text, we can use the AutoTokenizerclass:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokens = tokenizer.tokenize("This is a sample sentence.")

Enter fullscreen mode Exit fullscreen mode

To perform part-of-speech tagging and chunking, we can use the AutoModelForTokenClassification and AutoModelForSequenceClassification classes, respectively:

from transformers import AutoModelForTokenClassification, AutoModelForSequenceClassification

pos_model = AutoModelForTokenClassification.from_pretrained("bert-base-cased")
chunk_model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")

pos_tags = pos_model.predict(tokens)
chunks = chunk_model.predict(tokens)

Enter fullscreen mode Exit fullscreen mode

3.Feature Extraction:

Once our data has been preprocessed, we need to extract features that can be used to train a machine learning model. In NER, common features include:

  • Word embeddings: Represent each token as a dense vector, capturing its semantic meaning.
  • Part-of-speech tags: Use the part-of-speech tags assigned during preprocessing as features.
  • Chunk embeddings: Represent each chunk as a dense vector, capturing its meaning as a group of tokens. In Hugging Face, we can use the AutoModel class to extract word embeddings and chunk embeddings:
from transformers import AutoModel

embedding_model = AutoModel.from_pretrained("bert-base-cased")

word_embeddings = embedding_model(tokens)
chunk_embeddings = embedding_model(chunks)

Enter fullscreen mode Exit fullscreen mode

4.Model Training:

There are many different machine learning models that can be used for NER, including:

  • Conditional Random Fields (CRFs): CRFs are a type of probabilistic model that can be used to predict a label for each token in a sequence, given the sequence of tokens and the labels of surrounding tokens. CRFs are often used for NER because they can capture the dependencies between tokens and labels in the input sequence.

  • Recurrent Neural Networks (RNNs): RNNs are a type of neural network that are well-suited to processing sequential data, such as natural language. RNNs can be trained to predict a label for each token in a sequence by learning to process the input tokens one at a time, taking into account the context provided by previous tokens.

  • Transformer Models: Transformer models are a type of neural network that use self-attention mechanisms to process sequential data. They have been shown to be very effective for NER, particularly when trained on large amounts of data.

  • Support Vector Machines (SVMs): SVMs are a type of linear classifier that can be used to predict a label for each token in a sequence, based on the features extracted from the input tokens. SVMs can be effective for NER when the input features are carefully chosen to capture important information about the input tokens.

5.Model evaluation

is an important step in the machine learning process, as it helps you determine how well your model is performing on a given task. In the context of Named Entity Recognition (NER), there are a number of metrics that you can use to evaluate your model's performance:

  1. Precision: Precision is the number of correct named entity predictions made by the model, divided by the total number of named entity predictions made by the model. A high precision score indicates that the model is good at identifying named entities, but may be less sensitive to all named entities in the text.

  2. Recall: Recall is the number of correct named entity predictions made by the model, divided by the total number of named entities present in the text. A high recall score indicates that the model is sensitive to all named entities in the text, but may make more false positive predictions.

  3. F1 score: The F1 score is the harmonic mean of precision and recall. It is a balance between precision and recall, and is a good overall measure of a model's performance.

  4. Confusion matrix: A confusion matrix is a table that shows the number of true positive, true negative, false positive, and false negative predictions made by the model. It can be useful for identifying specific areas where the model is making errors, and for comparing the performance of different models.

  5. Classification report: A classification report is a summary of the performance of a model on a classification task. It includes precision, recall, and F1 score for each class, as well as a micro-averaged and macro-averaged F1 score.

6.Model deployment

refers to the process of making a trained machine learning model available for use in production. In the context of named entity recognition (NER), model deployment may involve integrating the NER model into an application or system that can perform NER on new, unseen data.
There are various ways to deploy a machine learning model for NER, including:

  1. Training the model on a powerful server and deploying the model on the server as an API or web service. This allows the model to be accessed and used by other applications or systems over the internet.

  2. Embedding the model directly into an application, such as a mobile app or a web application. This allows the model to be used locally, without the need to communicate with a server.

  3. Using a cloud-based machine learning platform, such as Amazon SageMaker or Google Cloud ML Engine, which provides tools and infrastructure for training, deploying, and managing machine learning models.

Regardless of the approach taken, it is important to carefully evaluate the performance and reliability of the deployed model and to monitor it regularly to ensure it continues to perform well.

Here's another helpful tutorial on Text Summarization
An another interesting article in this topic:5 NLP tasks using Hugging Face pipeline
A very useful video tutorial: ML Hyperproductivity with Hugging Face

Conclusion:

Overall, Hugging Face is a powerful and convenient tool for working with NLP models, and is well worth considering for any NLP project.

Top comments (0)