DEV Community

Cover image for Seamless Deployment of Hugging Face Models on AWS SageMaker with Terraform: A Comprehensive Guide

Posted on

Seamless Deployment of Hugging Face Models on AWS SageMaker with Terraform: A Comprehensive Guide

When integrating Sagemaker with Hugging Face models using the default setup provided by the sagemaker-huggingface-inference-tollkit can be a good starting point. For a IaC setup, the terraform-aws-sagemaker-huggingface module is a handy resource,

However, during my experience, I ran into a few issues with the Sagemaker-Huggingface-Inference-Toolkit:

Deployment Flexibility: The toolkit was limited to deploying only through the Python SDK, which was quite restrictive. (As it described in docs. Actually you can, for example using terraform module mentioned above)
Code and Model Packaging: If you want to customize inference code it required storing the code with the model weights into a single tar file, which felt clunky. I prefer having the code as part of the image itself.
Custom Environments: The sagemaker-huggingface-inference-toolkit doesn't allow for custom environment setups, like installing the latest Transformers directly from GitHub.

One specific issue was the lack of support for setting torch_dtype to half precision for the pipelines, which was crucial for my project but not straightforward to implement.

Given these limitations, I decided against rewriting everything to default sagemaker-inference-toolkit and instead explored a solution that just overrides get_pipline function in sagemaker-huggingface-inference-toolkit. Using following example you can customize in a any way you would like

How to Deploy

Load model weights

The first step is to put model weights to s3 bucket in model.tar.gz file. Instructions how to do it here

Make entrypoint

The deployment starts with setting up an entrypoint script. This script acts as the bridge between your model and Sagemaker, telling Sagemaker how to run your model. Here's a basic template I used:

from pathlib import Path

import torch
from transformers import Pipeline, pipeline
from sagemaker_huggingface_inference_toolkit import transformers_utils, serving

def _get_pipeline(task: str, device: int, model_dir: Path, **kwargs) -> Pipeline:
    return pipeline(model=model_dir, device_map="auto", model_kwargs={"torch_dtype": torch.bfloat16})

transformers_utils.get_pipeline = _get_pipeline

if __name__ == "__main__":
Enter fullscreen mode Exit fullscreen mode

Build image

Next, you'll need to build a Docker image that Sagemaker can use to run your model. This involves starting with a basic transformers pytorch image (, than install sagemaker-huggingface-inference-toolkit with mms(multi model server), openjdk and congifure entrypoint.

FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04
LABEL maintainer="Hugging Face"

ARG DEBIAN_FRONTEND=noninteractive

RUN apt update
RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg
RUN python3 -m pip install --no-cache-dir --upgrade pip

ARG REF=main
RUN git clone && cd transformers && git checkout $REF

# If set to nothing, will install the latest version
ARG PYTORCH='1.13.1'
# Example: `cu102`, `cu113`, etc.
ARG CUDA='cu121'

RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url$CUDA
# RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' ||  VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url$CUDA
# RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' ||  VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url$CUDA

RUN python3 -m pip install --no-cache-dir -e ./transformers

# When installing in editable mode, `transformers` is not recognized as a package.
# this line must be added in order for python to be aware of transformers.
RUN cd transformers && python3 develop

RUN apt-get install -y \
RUN pip install "sagemaker-huggingface-inference-toolkit[mms]"

COPY ./ /usr/local/bin/
RUN chmod +x /usr/local/bin/

RUN mkdir -p /home/model-server/

# Define an entrypoint script for the docker image
ENTRYPOINT ["python3", "/usr/local/bin/"]

Enter fullscreen mode Exit fullscreen mode

Now, push your image to your ECR

Deploy using terraform

Finally, you'll use Terraform to deploy everything to AWS. This includes setting up the endpoint role, model, its endpoint configuration, and the endpoint itself. Here's a simplified version of what the Terraform setup might look like:

resource "aws_sagemaker_model" "customHuggingface" {
  name = "custom-huggingface"

  primary_container {
    image          = "<YOUR_ACCOUNT>.dkr.ecr.<REGION><REPO>:<TAG>"
    model_data_url = "s3://<BUKET>/<PATH>/model.tar.gz"

data "aws_iam_policy_document" "assume_role" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = [""]

resource "aws_iam_role" "yourRole" {
  name               = "yourRole"
  assume_role_policy = data.aws_iam_policy_document.assume_role.json

data "aws_iam_policy_document" "InferenceAcess" {
  statement {
    actions   = ["s3:GetObject"]
    resources = ["arn:aws:s3:::<yourBucket>/*"]
  statement {
    actions = [

    resources = ["<YOUR_ECR>"]
  statement {
    resources = ["*"]
    actions = [

resource "aws_iam_policy" "InferenceAcess" {
  name        = "InferenceAcess"
  policy      = data.aws_iam_policy_document.InferenceAcess.json

resource "aws_iam_role_policy_attachment" "InferenceAcess" {
  role       =
  policy_arn = aws_iam_policy.InferenceAcess.arn
resource "aws_sagemaker_endpoint_configuration" "customHuggingface" {
  name = "customHuggingface"

  production_variants {
    variant_name           = "variant-1"
    model_name             =
    initial_instance_count = 1
    instance_type          = "ml.g4dn.xlarge"


resource "aws_sagemaker_endpoint" "customHuggingface" {
  name                 = "customHuggingface"
  endpoint_config_name =

Enter fullscreen mode Exit fullscreen mode

Invoke your endpoint

After everything is deployed, you can test the endpoint with a simple request to make sure it's working as expected.

body = json.dumps({"inputs": <Your text>})
endpoint = "customHuggingface"
response = runtime.invoke_endpoint(EndpointName=endpoint, ContentType='application/json', Body=body)
Enter fullscreen mode Exit fullscreen mode

Useful links:

Top comments (0)