praveenr

Posted on Feb 29 • Edited on Mar 2

My Embeddings Stay Close To Each Other, What About Yours?

#nlp #datascience #machinelearning #rag

This blog will help you generate embeddings for your datasets such that semantically related sentences stay close to each other in other words, this blog will help you fine-tune commonly available SBERT(Sentence BERT) models in hugging face using your dataset.

LITTLE BACKGROUND ABOUT SBERT

Sentence BERT was first introduced in the paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In this paper, the authors have proposed a modification of the pre-trained BERT network that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine similarity.

This blog is not about how SBERT works but rather how to finetune a pre-trained SBERT, so let's go ahead.

WHY FINETUNE

Sometimes when you try to retrieve some information using any distance metric like cosine similarity the retriever might fetch unintended information, the reason being the unintended information is closer to your query in vector space.

In the above image your question vector and irrelevant vector are close to each and why does this happen ???
A few reasons might be

Wrong choice of embedding model - The model might be trained on a dataset from a different domain.
The terms or words that you use might be unseen during model training

SO WHAT'S THE SOLUTION

If you find that your use case has some unseen words or you have better datasets which you believe could make the model generate quality embeddings you could go for fine-tuning.

FINE-TUNING SENTENCE BERT FROM HUGGING FACE

We are going to use all-MiniLM-L6-v2 model from hugging face.

Required Libraries

pip3 install torch
pip3 install pandas
pip3 install -U sentence-transformers

Little Bit Of Clarity

By finetuning we mean to ask the model to consider the pair of sentences that we send as training data points to be close to each other, there are several ways to organize your training data and a table explaining it is given below

image credits

In this blog, we are going to use a pair of positive sentences without label for each training data point and the sentence pair denotes closely related sentences. The corresponding loss function would be MultipleNegativesRankingLoss

TRAINING

import pandas as pd
import os
from sentence_transformers import SentenceTransformer
from sentence_transformers import InputExample
from sentence_transformers import losses
from torch.utils.data import DataLoader

class trainSBERT:
    def prepare_training_data(self, source_sentence_list, target_sentence_list):
        """
        Each training data point must have 2 two similar sentences inside a list
        Eg - [sentence 1, sentence 2]

        INPUT
        source_sentence_list - List : All source sentences
        target_sentence_list - List : All target sentences

        RETURNS
        train_dataloader - Pytorch dataloader object 
        """
        train_data_list = []
        for source, target in zip(source_sentence_list, target_sentence_list):
            print(source, target)
            train_data_list.append(InputExample(texts=[source, target]))

        train_dataloader = DataLoader(train_data_list, shuffle=True, batch_size=64)
        return train_dataloader

    def train_sbert(self, model_name_list, n_epochs, source_sentence_list, target_sentence_list, path_to_save_model):
        """
        Used to train various sentence bert model

        INPUT
        model_name_list - List : List of model names from hugging face to be trained
        n_epochs - Int : Epochs to be trained for
        source_sentence_list - List : All source sentences
        target_sentence_list - List : All target sentences
        path_to_save_model - String : Path to save trained model

        RETURNS
        None
        """
        train_dataloader = self.prepare_training_data(source_sentence_list, target_sentence_list)
        for model_name in model_name_list:
            sbert_model = SentenceTransformer(model_name)

            train_loss = losses.MultipleNegativesRankingLoss(model=sbert_model)
            warmup_steps = int(len(train_dataloader) * n_epochs * 0.1) #10% of train data

            sbert_model.fit(train_objectives=[(train_dataloader, train_loss)],
                    epochs=n_epochs,
                    warmup_steps=warmup_steps) 

            os.makedirs(f"{path_to_save_model}/{model_name.replace('/', '_')}")
            sbert_model.save(f"{path_to_save_model}/{model_name.replace('/', '_')}")

We are creating a class with 2 functions
prepare_training_data - Used to convert training data into pytorch data loader format.
train_sbert - Used to train sbert models and save them in your local directory.

This is how your training data CSV file should look like

df = pd.read_csv('training_data.csv')
obj = trainSBERT()
obj.train_sbert(['sentence-transformers/all-MiniLM-L6-v2'], 500, df['source_sentence'].tolist(), df['target_sentence'].tolist(), "/Users/praveen/Desktop/praveen/github/training/model/sbert")

After 500 epochs the trained model will be saved to /Users/praveen/Desktop/praveen/github/training/model/sbert/sentence-transformers_all-MiniLM-L6-v2

All the below files will be saved to your local directory inside sentence-transformers_all-MiniLM-L6-v2 folder

HOW TO USE THE TRAINED MODEL TO GENERATE EMBEDDINGS

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('/Users/praveen/Desktop/praveen/github/training/model/sbert/sentence-transformers_all-MiniLM-L6-v2')
question_embeddings = model.encode([question], convert_to_tensor=True)
answer_embeddings = model.encode([answer], convert_to_tensor=True)
print("Question Embeddings : ", question_embeddings)
print("Answer Embeddings : ", answer_embeddings)

Now you can compare these two using cosine-similarity to calculate how close they are.

Hope this helps :))
LINKED IN : https://www.linkedin.com/in/praveenr2998/

DEV Community

My Embeddings Stay Close To Each Other, What About Yours?

LITTLE BACKGROUND ABOUT SBERT

WHY FINETUNE

SO WHAT'S THE SOLUTION

FINE-TUNING SENTENCE BERT FROM HUGGING FACE

Required Libraries

Little Bit Of Clarity

TRAINING

HOW TO USE THE TRAINED MODEL TO GENERATE EMBEDDINGS

Top comments (0)

Read next

Top 10 Real-World Applications of Artificial Intelligence to Watch in 2025

How Does Machine Learns (Intro to Machine Learning)

Day 27: Regularization Techniques for Large Language Models (LLMs)

🤖 RAG vs. Agents: A Comparison and When to Use Each