DEV Community


Sagemaker Model deployment and Integration

Sagemaker Model deployment and Integration


AWS Feature store

SageMaker Feature Store is a purpose-built solution for ML feature management. It helps data science teams reuse ML features across teams and models, serve features for model predictions at scale with low latency, and train and deploy new models more quickly and effectively.

Refer the notebook for more details.


Why is feature lineage important?

Imagine trying to manually track all of this for a large team, multiple teams, or even multiple business units. Lineage tracking and querying helps make this more manageable and helps organizations move to ML at scale. The following are four examples of how feature lineage helps scale the ML process:

  • Build confidence for reuse of existing features
  • Avoid reinventing features that are based on the same raw data as existing features
  • Troubleshoot and audit models and model predictions
  • Manage features proactively

AWS ML Lens and built-in models



Deployment Options

ML inference can be done in real time on individual records, such as with a REST API endpoint. Inference can also be done in batch mode as a processing job on a large dataset. While both approaches push data through a model, each has its own target goal when running inference at scale.

* Real Time* Micro Batch Batch
*Execution Mode * Synchronous Synchronous/Asynchronous Asynchronous
*Prediction Latency * Subsecond Seconds to minutes Indefinite
Data Bounds Unbounded/stream Bounded Bounded
*Execution Frequency * Variable Variable Variable/fixed
*Invocation Mode * Continuous stream/API calls Event-based Event-based/scheduled
Examples Real-time REST API endpoint Data analyst running a SQL UDF Scheduled inference job

Realtime deployment

Sagemaker real-time deployment has the following approach. Key point here is that we can have our inference pipeline coupled with autoscale.


Here are different ways, we can deploy real-time endpoint by sagemaker. You can see here multiple options from own model, own container to prebuilt container.


With sagemaker, prebuilt container and its own inference script, we can use this as shared below. im

Quite a lot of time, we add our own inference script and this is quite simple as shown below. im

It is not rare to have our own container and own trained model along with inference script. The architecture does not change for that and we still follow same architecture as shared below. im


we can set autoscale policy for sagemaker endpoint to scale up and scale down automatically. im

We have to set autoscale policy setup for endpoint. You can see here that ServiceNamespace is set to sgaemaker and resourceId is set to Endpoint name.


Multi Modal endpoint

SageMaker multi-model endpoints work with several frameworks, such as TensorFlow, PyTorch, MXNet, and sklearn, and you can build your own container with a multi-model server. Multi-model endpoints are also supported natively in the following popular SageMaker built-in algorithms: XGBoost, Linear Learner, Random Cut Forest (RCF), and K-Nearest Neighbors (KNN).

Refer the notebook to understand how we can deploy this/. Refer the blog

  • All of the models that are hosted on a multi-modal endpoint must share the same serving container image.

  • Multi-model endpoints are an option that can improve endpoint utilization when your models are of similar size and share the same container image and have similar invocation latency requirements.

  • all the model needs to share same S3 bucket to host their weights



Cost advantages

This diagram demonstrates running 10 models on a multi-model endpoint versus using 10 separate endpoints. This results in savings of $3,000 per month, as shown in the following figure: Multi-model endpoints can easily scale to hundreds or thousands of models.


How to use?

To create a multi-model endpoint in Amazon SageMaker, choose the multi-model option, provide the inference serving container image path, and provide the Amazon S3 prefix in which the trained model artifacts are stored. You can organize your models in S3 any way you wish, so long as they all use the same prefix.

When you invoke the multi-model endpoint, you provide the relative path of a specific model with the new TargetModel parameter of InvokeEndpoint. To add models to the multi-model endpoint, simply store a newly trained model artifact in S3 under the prefix associated with the endpoint. The model will then be immediately available for invocations.

To update a model already in use, add the model to S3 with a new name and begin invoking the endpoint with the new model name. To stop using a model deployed on a multi-model endpoint, stop invoking the model and delete it from S3.

Instead of downloading all the models into the container from S3 when the endpoint is created, Amazon SageMaker multi-model endpoints dynamically load models from S3 when invoked. As a result, an initial invocation to a model might see higher inference latency than the subsequent inferences, which are completed with low latency. If the model is already loaded on the container when invoked, then the download step is skipped and the model returns the inferences with low latency.


Monitoring multi-model endpoints using Amazon CloudWatch metrics

To make price and performance tradeoffs, you will want to test multi-model endpoints with models and representative traffic from your own application. Amazon SageMaker provides additional metrics in CloudWatch for multi-model endpoints so you can determine the endpoint usage and the cache hit rate and optimize your endpoint. The metrics are as follows:

  • ModelLoadingWaitTime – The interval of time that an invocation request waits for the target model to be downloaded or loaded to perform the inference.
  • ModelUnloadingTime – The interval of time that it takes to unload the model through the container’s UnloadModel API call.
  • ModelDownloadingTime – The interval of time that it takes to download the model from S3.
  • ModelLoadingTime – The interval of time that it takes to load the model through the container’s LoadModel API call.
  • ModelCacheHit – The number of InvokeEndpoint requests sent to the endpoint where the model was already loaded. Taking the Average statistic shows the ratio of requests in which the model was already loaded.
  • LoadedModelCount – The number of models loaded in the containers in the endpoint. This metric is emitted per instance. The Average statistic with a period of 1 minute tells you the average number of models loaded per instance, and the Sum statistic tells you the total number of models loaded across all instances in the endpoint. The models that this metric tracks are not necessarily unique because you can load a model in multiple containers in the endpoint.

You can use CloudWatch charts to help make ongoing decisions on the optimal choice of instance type, instance count, and number of models that a given endpoint should host.

Inference Pipeline sagemaker

You can use trained models in an inference pipeline to make real-time predictions directly without performing external preprocessing. When you configure the pipeline, you can choose to use the built-in feature transformers already available in Amazon SageMaker. Or, you can implement your own transformation logic using just a few lines of scikit-learn or Spark code.

Refer / for more details.

  • Inference pipeline allows you to host multiple models behind a single endpoint. But in this case, the models are sequential chain of models with the steps that are required for inference. This allows you to take your data transformation model, your predictor model, and your post-processing transformer, and host them so they can be sequentially run behind a single endpoint.
  • As you can see in this picture, the inference request comes into the endpoint, then the first model is invoked, and that model is your data transformation. The output of that model is then passed to the next step, which is actually your XGBoost model here, or your predictor model.
    • That output is then passed to the next step, where ultimately in that final step in the pipeline, it provides the final response or the post-process response to that inference request.
    • This allows you to couple your pre and post-processing code behind the same endpoint and helps ensure that your training and your inference code stay synchronized


Sagemaker Production Variant

Amazon SageMaker enables you to test multiple models or model versions behind the same endpoint using production variants. Each production variant identifies a machine learning (ML) model and the resources deployed for hosting the model. By using production variants, you can test ML models that have been trained using different datasets, trained using different algorithms and ML frameworks, or are deployed to different instance type, or any combination of all of these. You can distribute endpoint invocation requests across multiple production variants by providing the traffic distribution for each variant, or you can invoke a specific variant directly for each request. In this topic, we look at both methods for testing ML models.

Refer the notebook for implementation details.

Test models by specifying traffic distribution

Specify the percentage of the traffic that gets routed to each model by specifying the weight for each production variant in the endpoint configuration.


Test models by invoking specific variants

Specify the specific version of the model you want to invoke by providing a value for the TargetVariant parameter when you call InvokeEndpoint.


Amazon SageMaker Batch Transform: Batch Inference

We’ll use the Sagemaker Batch Transform Jobs and a trained machine learning model. It is assumed that we have already trained the model, pushed the Docker image to ECR, and registered the model in Sagemaker.

  • we need the identifier of the Sagemaker model we want to use and the location of the input data
  • either use a built-in container for your inference image or you can also bring your own.
  • Batch Transform partitions the Amazon S3 objects in the input by key and maps Amazon S3 objects to instances. When you have multiples files, one instance might process input1. csv , and another instance might process the file named input2. csv

In Batch Transform you provide your inference data as a S3 uri and SageMaker will care of downloading it, running the prediction and uploading the results afterwards to S3 again. You can find more documentation for Batch Transform here

If you trained a model using the Hugging Face Estimator, call the transformer() method to create a transform job for a model based on the training job (see here for more details): Refer


batch job has

  • instance count
  • instance type

transform job has

  • data location
  • content type
batch_job = huggingface_estimator.transformer(

Enter fullscreen mode Exit fullscreen mode

If you want to run your batch transform job later or with a model from the 🤗 Hub, create a HuggingFaceModel instance and then call the transformer() method:

from sagemaker.huggingface.model import HuggingFaceModel

# Hub model configuration <>
hub = {

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   transformers_version="4.6",                             # Transformers version used
   pytorch_version="1.7",                                  # PyTorch version used
   py_version='py36',                                      # Python version used

# create transformer to run a batch job
batch_job = huggingface_model.transformer(
    output_path=output_s3_path, # we are using the same s3 path to save the output with the input

# starts batch transform job and uses S3 data as input
Enter fullscreen mode Exit fullscreen mode

The input.jsonl looks like this:

import json
from sagemaker.s3 import S3Downloader
from ast import literal_eval
# creating s3 uri for result file -> input file + .out
output_file = f"{dataset_jsonl_file}.out"
output_path = s3_path_join(output_s3_path,output_file)

# download file,'.')

batch_transform_result = []
with open(output_file) as f:
    for line in f:
        # converts jsonline array to normal array
        line = "[" + line.replace("[","").replace("]",",") + "]"
        batch_transform_result = literal_eval(line) 

# print results 
Enter fullscreen mode Exit fullscreen mode
{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}
Enter fullscreen mode Exit fullscreen mode

📓 Open the notebook for an example of how to run a batch transform job for inference.

Speeding up the processing

We have only one instance running, so processing the entire file may take some time. We can increase the number of instances using the instance_count parameter to speed it up. We can send multiple requests to the Docker container simultaneously, too. The configure concurrent transformations we must use the max_concurrent_transforms parameter.

Processing the output

In the end, we must get access to the output. We’ll find the output files in the location specified in the Transformer constructor. Every line contains the prediction and the input parameters. agemaker-notebook.ipynb) for an example of how to run a batch transform job for inference.

Top comments (0)

Timeless DEV post...

Git Concepts I Wish I Knew Years Ago

The most used technology by developers is not Javascript.

It's not Python or HTML.

It hardly even gets mentioned in interviews or listed as a pre-requisite for jobs.

I'm talking about Git and version control of course.

One does not simply learn git