DEV Community

Atsushi Suzuki
Atsushi Suzuki

Posted on • Updated on

Optimizing Docker Images and Conda Environments in SageMaker Training

Introduction

Previously, I set up a Conda environment inside a Docker container for training in Amazon SageMaker. To align the environments between the debugging environment (EC2/Deep Learning AMI) and the training environment (SageMaker), I exported the environment.yml configuration file from the debugging environment and used it to build the Docker image for the training environment.

Choosing a SageMaker Training specific base image for Docker resulted in a large, approximately 11 GB image, due to duplication of libraries in the base image and environment.yml. At that time, I continued using this container image due to the urgency of running the training sessions. However, the long load times for the image and anticipated difficulties in future version upgrades of key libraries (like transformers and torch) led me to reorganize the environment.yml and Dockerfile.

This should be helpful for those wanting to use Conda environments built in a debug environment in SageMaker but are unsure about Docker image construction.

Implementation

Here are the final versions of the Dockerfile and environment.yml:

# Using Amazon SageMaker's PyTorch GPU training base image
FROM 763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker

# Setting the path for Conda (as it's included in the base image, installation is unnecessary)
ENV PATH /opt/conda/bin:$PATH

# Initializing Conda environment
RUN conda init bash

# Copying the environment.yml file to the container
COPY environment.yml /root/environment.yml

# Creating the environment from environment.yml
RUN conda env create -f /root/environment.yml

# Automatically activating the Conda environment
RUN echo "conda activate environment" >> ~/.bashrc

# Starting bash in the environment when the container launches
CMD [ "/bin/bash" ]
Enter fullscreen mode Exit fullscreen mode
name: environment
channels:
  - anaconda
  - pytorch
  - huggingface
  - conda-forge
dependencies:
  - python=3.8.13
  - pandas=1.4.3
  - scikit-learn=1.1.1
  - transformers=4.26.0
  - numpy=1.23.1
  - libgcc
  - imbalanced-learn=0.10.1
  - pip:
      - wandb
      - python-dotenv
Enter fullscreen mode Exit fullscreen mode

By using SageMaker's PyTorch GPU training base image and avoiding duplicate installations in the Conda environment, I reduced the image size to about 8 GB, a 3 GB reduction from the previous image.

Supplement 1: Building the Image and Pushing to ECR

Here's a supplementary guide on building the Dockerfile and pushing it to a private ECR repository.

Before building the image using deep-learning-containers, it's necessary to specify the region and ECR registry, obtain an authentication token, and authenticate the Docker client.

$ aws ecr get-login-password --region ap-northeast-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.ap-northeast-1.amazonaws.com
Enter fullscreen mode Exit fullscreen mode

With this, the image can be built.

$ docker build -t <Account ID>.dkr.ecr.ap-northeast-1.amazonaws.com/<Image Name>:<Tag Name> .
Enter fullscreen mode Exit fullscreen mode

※For ARM-based Macs (M1~3), use the following command to build:

$ docker buildx build --platform linux/amd64 -f Dockerfile -t <Account ID>.dkr.ecr.ap-northeast-1.amazonaws.com/<Image Name>:<Tag Name> .
Enter fullscreen mode Exit fullscreen mode

Similarly, authenticate with your own account's ECR registry before pushing the image.

$ aws ecr get-login-password --region ap-northeast-1 | docker login --username AWS --password-stdin <Account ID>.dkr.ecr.ap-northeast-1.amazonaws.com
Enter fullscreen mode Exit fullscreen mode

Now, you can push the image.

$ docker push <Account ID>.dkr.ecr.ap-northeast-1.amazonaws.com/<Image Name>:<Tag Name>
Enter fullscreen mode Exit fullscreen mode

Supplement 2: Setting Up the Entry Point Script for SageMaker

When setting up SageMaker Training jobs, use a Python script like the following. However, this setup alone doesn't automatically switch to the Conda environment, requiring additional steps.

import sagemaker
from sagemaker.estimator import Estimator

session = sagemaker.Session()
role = sagemaker.get_execution_role()

estimator = Estimator(
    image_uri="<URL of the container image you created>",
    role=role,
    instance_type="ml.g4dn.2xlarge",
    instance_count=1,
    base_job_name="pre-training",
    output_path="s3://<bucket name>/sagemaker/output_data/pre_training",
    code_location="s3://<bucket name>/sagemaker/output_data/pre_training",
    sagemaker_session=session,
    entry_point="pre-training.sh",
    dependencies=["tabformer"],
    hyperparameters={
        "mlm": True,
        "do_train": True,
        "field_hs": 64,
        "output_dir": "/opt/ml/model/",
        "data_root": "/opt/ml/input/data/input_data/",
        "data_fname": "<file name>"
    }
)
estimator.fit({
    "input_data": "s3://<bucket name>/sagemaker/input_data/<file name>.csv"
})
Enter fullscreen mode Exit fullscreen mode

Therefore, the shell script specified in entry_point must switch to the Conda environment.

#!/bin/bash

# conda activate
source /opt/conda/etc/profile.d/conda.sh
conda activate environment

# pre-training
python main.py
Enter fullscreen mode Exit fullscreen mode

Supplement 3: A Failed Attempt

I also present an example of a Dockerfile and environment.yml aimed at further reducing image size. This example assumes a debug environment using a standard Ubuntu AMI on an EC2 instance.

First, in the Dockerfile, the base image is ubuntu. Since this base image doesn't include conda, Miniconda was installed using wget (specifying py38). Additionally, build-essential and sagemaker-training were installed separately.

※The absence of sagemaker-training can result in errors. This package is essential for SageMaker Training.

# Using Ubuntu 22.04 LTS as the base image
FROM ubuntu:22.04

# Installing necessary packages
RUN apt-get update && \
    apt-get install -y wget bzip2 ca-certificates build-essential && \
    rm -rf /var/lib/apt/lists/*

# Downloading and installing Miniconda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-py38_23.11.0-2-Linux-x86_64.sh -O ~/miniconda.sh && \
    /bin/bash ~/miniconda.sh -b -p /opt/conda && \
    rm ~/miniconda.sh && \
    /opt/conda/bin/conda clean --all --yes

# Setting the Conda path
ENV PATH /opt/conda/bin:$PATH

# Initializing Conda environment
RUN conda init bash

# Copying the environment.yml file to the container
COPY environment.yml /root/environment.yml

# Creating the environment from environment.yml
RUN conda env create -f /root/environment.yml

# Automatically activating the Conda environment
RUN echo "conda activate environment" >> ~/.bashrc

# Installing the sagemaker-training package
RUN pip install sagemaker-training

# Starting bash in the 'tabformer-opt-v2' environment when the container launches
CMD [ "/bin/bash" ]
Enter fullscreen mode Exit fullscreen mode

In the environment.yml, dependencies not included in the build target are specified. As a result, this file has more dependencies than the adopted environment.yml.

name: environment
channels:
  - anaconda
  - pytorch
  - huggingface
  - conda-forge
dependencies:
  - python=3.8.13
  - pip>=21.0
  - pytorch=1.12.1
  - torchaudio=0.12.1
  - torchvision=0.13.1
  - pandas=1.4.3
  - scikit-learn=1.1.1
  - transformers=4.26.0
  - numpy
  - libgcc
  - cudatoolkit=11.6
  - imbalanced-learn=0.10.1
  - pip:
      - transformers==4.26.0
      - wandb
      - python-dotenv
Enter fullscreen mode Exit fullscreen mode

With this setup, the image size was reduced to about 3 GB, and training could commence in both the debug and SageMaker environments. However, the training was unexpectedly running on CPU instead of GPU.

Therefore, it wasn't feasible for actual use. This experience highlighted the importance of building environment.yml based on the libraries included in SageMaker's PyTorch GPU training base image.

Top comments (0)