Amit Kayal for AWS Community Builders

Posted on Oct 16, 2022 • Edited on Jan 17, 2023

My journey into Sentence Transformer

#aws #nlp #datascience #machinelearning

Last few days I have been exploring sentence transformer and this page documents my notes/understanding. This note explains the basic of sentence transformer and deployment the same through sagemaker endopoint. The endpoint is then accessed from lambda. Terraform has been considered here for deployment.

What is Sentence Embedding?

I came across a nice example posted by Mathias about sentence comparison. Consider the following statements: “Nuclear power is dangerous!” and “Nuclear power is the future of energy!”

If we are talking about the topic, then definitely yes: both statements are opinions on nuclear power. So in that sense, they are very similar.
However, if we are talking about sentiment, then the answer is a resounding no. They are about as dissimilar in terms of sentiment as we can get.

What is Transformer and BERT?

BART is a denoising autoencoder for pretraining sequence-to-sequence models. It is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Transformer-based neural machine translation architecture. It uses a standard seq2seq/NMT architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT). This means the encoder's attention mask is fully visible, like BERT, and the decoder's attention mask is causal, like GPT2.

While BERT was trained by using a simple token masking technique, BART empowers the BERT encoder by using more challenging kinds of masking mechanisms in its pre-training. Once we get the token and sentence-level representation of an input text sequence, a decoder needs to interpret these to map with the output target.

Building task specific Transformer based solution

Green portion is pretrained one
other portion (purple) is custom head which we train further
QA head is predicting span of text which contains answer from context/text
Classification Head predicting binary value.

BERT for Sentence Similarity

Transformers work using word or token-level embeddings, not sentence-level embeddings.

Regular transformers produce sentence embeddings by performing some pooling operation such as the element-wise arithmetic mean on its token-level embeddings. A good pooling choice for BERT is CLS pooling. BERT has a special <CLS> token that is supposed to capture all the sequence information. It gets tuned on next-sentence prediction (NSP) during pre-training.

Before sentence transformers, the approach to calculating accurate sentence similarity with BERT was to use a cross-encoder structure. This meant that we would pass two sentences to BERT, add a classification head to the top of BERT — and use this to output a similarity score.

The BERT cross-encoder architecture consists of a BERT model which consumes sentences A and B. Both are processed in the same sequence, separated by a [SEP] token. All of this is followed by a feedforward NN classifier that outputs a similarity score.
The cross-encoder network does produce very accurate similarity scores (better than SBERT), but it’s not scalable. If we wanted to perform a similarity search through a small 100K sentence dataset, we would need to complete the cross-encoder inference computation 100K times.
To cluster sentences, we would need to compare all sentences in our 100K dataset, resulting in just under 500M comparisons — this is simply not realistic.
Ideally, we need to pre-compute sentence vectors that can be stored and then used whenever required. If these vector representations are good, all we need to do is calculate the cosine similarity between each.
With the original BERT (and other transformers), we can build a sentence embedding by averaging the values across all token embeddings output by BERT (if we input 512 tokens, we output 512 embeddings). Alternatively, we can use the output of the first [CLS] token (a BERT-specific token whose output embedding is used in classification tasks).
Using one of these two approaches gives us our sentence embeddings that can be stored and compared much faster, shifting search times from 65 hours to around 5 seconds (see below). However, the accuracy is not good, and is worse than using averaged GloVe embeddings (which were developed in 2014).

Sentence Transformer?

How Does it Work?

SentenceTransformers is a Python framework for state-of-the-art sentence, text, and image embeddings. Embeddings can be computed for 100+ languages and they can be easily used for common tasks like semantic text similarity, semantic search, and paraphrase mining.

The solution of the above lack of an accurate model with reasonable latency was designed by Nils Reimers and Iryna Gurevych in 2019 with the introduction of sentence-BERT (SBERT) and the sentence-transformers library.

SBERT produces sentence embeddings — so we do not need to perform a whole inference computation for every sentence-pair comparison.

SBERT is fine-tuned on sentence pairs using a siamese architecture. We can think of this as having two identical BERTs in parallel that share the exact same network weights.

In reality, we are using a single BERT model. However, because we process sentence A followed by sentence B as pairs during training, it is easier to think of this as two models with tied weights.

Use cases

Sentence Transformer Architecture Changes

SBERT uses a siamese architecture where it contains 2 BERT architectures that are essentially identical and share the same weights, and SBERT processes 2 sentences as pairs during training.

The training process of sentence transformers is especially designed with semantic similarity in mind.

Cross-encoders

A cross-encoder is thus trained by sentence-pairs along with a ground-truth label of how semantically similar they are.

Cross-encoders tend to perform very well on sentence-level tasks, they do suffer from a major drawback: cross-encoders do not produce sentence embeddings. **In the context of information retrieval, this implies that we **cannot pre-compute document embeddings and efficiently compare these to a query embedding. We are also not able to index document embeddings for efficient search.

Bi-encoders

In the context of information retrieval, this implies that we cannot pre-compute document embeddings and efficiently compare these to a query embedding. We are also not able to index document embeddings for efficient search.

we feed sentence A to BERT A and sentence B to BERT B in SBERT. Each BERT outputs pooled sentence embeddings. While the original research paper tried several pooling methods, they found mean-pooling was the best approach. Pooling is a technique for generalizing features in a network, and in this case, mean pooling works by averaging groups of features in the BERT.
After the pooling is done, we now have 2 embeddings: 1 for sentence A and 1 for sentence B. When the model is training, SBERT concatenates the 2 embeddings which will then run through a softmax classifier and be trained using a softmax-loss function.
At inference — or when the model actually begins predicting — the two embeddings are then compared using a cosine similarity function, which will output a similarity score for the two sentences. Here is a diagram for SBERT when it is fine-tuned and at inference.

How can you use SBERT for sagemaker endpoint?

BERT has its own Python library. Using it is as simple as using a model from the hugging face transformer library. Here, we have used multi-qa-MiniLM-L6-cos-v1 model for sentence similarity.

Here, I have shown how we can deploy this model as our sagemaker serverless endpoint.

# Choose transformer model for embeddings
from transformers import AutoTokenizer, AutoModel
import os
import sagemaker
import time
saved_model_dir = 'transformer'
os.makedirs(saved_model_dir, exist_ok=True)

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
model = AutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1") 

tokenizer.save_pretrained(saved_model_dir)
model.save_pretrained(saved_model_dir)

#Defining default bucket for SageMaker pretrained model hosting
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

!cd transformer && tar czvf ../model.tar.gz *

model_data = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='autofaiss-demo/huggingface-models')

from sagemaker.huggingface.model import HuggingFaceModel


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=model_data,       # path to your model and script
   entry_point = 'predict.py',
   source_dir = 'source_dir',
   role=role,                    # iam role with permissions to create an Endpoint
   transformers_version="4.12",  # transformers version used
   pytorch_version="1.9",        # pytorch version used
   py_version='py38',            # python version used
)

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.2xlarge"
    )

Here, I have created a lambda_handler function where the above endpoint is being called for similarity prediction.

import logging
import json
import boto3
import io
import os
import time
import logging
import sagemaker
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import IdentitySerializer

logger = logging.getLogger()
logger.setLevel(logging.INFO)
ENDPOINT_NAME = "huggingface-pytorch-inference-2022-10-14-20-02-16-258"
sagemaker_session = sagemaker.Session()

"""
FunctionName: invoke_endpoint
Input: transcript_item (sentence), label_map
    transcript_item type: string
    label_map type: dict
Output: Question
    type: string
"""
# @tracer.capture_method
def invoke_endpoint(payload, endpoint_name):
    runtime = boto3.client('runtime.sagemaker')
    response = runtime.invoke_endpoint(EndpointName=endpoint_name,
                                      ContentType="application/json",
                                      Body=json.dumps(payload))
    embeddings = json.loads((response["Body"].read()))
    return embeddings

# @tracer.capture_lambda_handler
def lambda_handler(event):
    start = time.time()
    similarity_scores = invoke_endpoint(event, ENDPOINT_NAME)
    end = time.time()
    logger.info(f"Profiling: \n Getting Embeddings: {1000*(end-start)} milliseconds")   
    return similarity_scores


json_event = {  
    "query_from_app" : "How many people live in London?",
    "actual_queries" : ["Around 9 Million people live in London", "London is known for its financial district"]
}

lambda_handler(json_event)

The sample output is as shown below...

{'Scores': [['Around 9 Million people live in London', 0.9156370759010315],
  ['London is known for its financial district', 0.49475768208503723]]}

Serverless deployment of Sentence Transformer

I have shared below our lambda code and terraform code for deployment.

Lambda Code

I have used terraform to deploy the lambda function and the endpoint is being defined here as environment variable.

from query_request import query
DDB_ENDPOINT = os.environ["ServiceConfiguration__DDB_ENDPOINT"]
REGION = os.environ["ServiceConfiguration__REGION"]

def invoke_endpoint(payload, endpoint_name):
    runtime = boto3.client('runtime.sagemaker')
    response = runtime.invoke_endpoint(EndpointName=endpoint_name,
                                      ContentType="application/json",
                                      Body=json.dumps(payload))
    embeddings = json.loads((response["Body"].read()))
    return embeddings

def lambda_handler(event, context):
    start = time.time()
    similarity_scores = invoke_endpoint(event, ENDPOINT_NAME)
    end = time.time()
    logger.info(f"Similarity scores: {similarity_scores}")
    logger.info(f"Profiling: \n Getting Embeddings: {1000*(end-start)} milliseconds")   
    return similarity_scores

Terraform

module "questn_similarity_classification" {
  source  = "terraform-module/lambda/aws"
  version = "2.12.6"

  function_name = "questn_similarity_classification"
  filename      = data.archive_file.questn_similarity_classification-zip.output_path
  source_code_hash = data.archive_file.questn_similarity_classification-zip.output_base64sha256
  description      = "questn_similarity_classification"
  handler        = "questn_similarity_classification.lambda_handler"
  runtime        = "python3.7"
  memory_size    = "1280"
  concurrency    = "25"
  lambda_timeout = "120"
  log_retention  = "30"
  publish        = true
  role_arn       = aws_iam_role.questn_similarity_classification_role.arn
  tracing_config = { mode = "Active" }
 # layers = [aws_lambda_layer_version.numpy_layer_37.arn, data.aws_lambda_layer_version.ml_faiss_layer_version.arn]

  vpc_config = {
    subnet_ids         = tolist(data.aws_subnet.efs_subnet.*.id)
    security_group_ids = [data.aws_security_group.default_sec_grp.id]
  }
  environment = {
    ServiceConfiguration__ENDPOINT_NAME    = var.ServiceConfiguration__ENDPOINT_NAME
    ServiceConfiguration__REGION = var.ServiceConfiguration__REGION
  }
  file_system_config = {
    # file_system_arn              = data.aws_efs_access_point.knn_efs.arn
    efs_access_point_arn = data.aws_efs_access_point.knn_efs.arn
    local_mount_path     = var.file_system_local_mount_path # Local mount path inside the lambda function. Must start with '/mnt/'.
    # file_system_local_mount_path = var.file_system_local_mount_path # Local mount path inside the lambda function. Must start with '/mnt/'.

  }

DEV Community