Let's build a simple MLOps workflow on AWS! #2 - Building infrastructure on AWS

#aws #devops #githubactions #python

About this post

This post is a sequel to the previous one below. Please refer to the earlier post before reading this one.

Let's build a simple MLOps workflow on AWS! #1 - ML model preperation - DEV Community
https://dev.to/hikarunakatani/lets-build-a-simple-mlops-workflow-on-aws-1-ml-model-preperation-3af8

Overview

In the previous post, I showed how to implement a simple deep learning model. However, that code was intended for a local laptop environment and was purely experimental. By containerizing the application, you can ensure consistent and reproducible execution across different environments. This approach also enables the use of container orchestration tools like Kubernetes, which simplify managing, scaling, and orchestrating ML training jobs. Running machine learning tasks on a container orchestration tool is especially beneficial for training large ML models, as it allows for distributed training across multiple nodes and efficient resource utilization.
In this post, I'll explain how to run the training code as a Docker container on Amazon ECS. Additionally, I'll demonstrate how to automatically build and deploy the container when changes are made to the model.
Without further ado, let's first look at the overall architecture needed to implement this workflow!

Architecture of the system

In this system, the following workflow will be executed:

A developer pushes an ML model to the GitHub repository
The training task, including the model, is automatically built as a Docker image and pushed to the ECR repository
EventBridge detects the push in the ECR repository and invokes a Lambda function
Lambda function invokes the ECS task
The pre-trained ML model gets automatically saved to an S3 bucket

To achieve this, we'll tackle the following tasks step-by-step:

Preparing AWS resources to automate the deployment process of the training task.
Building a CI/CD pipeline for the ML model to automatically push Docker images to the repository.
Testing that the automated deployment process works properly.

In this post, I'll only explain how to implement the first step. Regarding building AWS resources, I chose Terraform so that we can test the code experimentally.

Preparing AWS resources by Terraform

There are a number of small resources to implement the whole system, but I'll focus on introducing the core service setting required to implement the workflow.

EventBridge

In order to trigger the ECS task in an event-driven manner, you have to prepare the event pattern in EventBridge. I used an event pattern to detect the push event in the ECR repository. After that, you need to set the Lambda as a target of the event rule.

# EventBridge
resource "aws_cloudwatch_event_rule" "ecr_push_rule" {
  name        = "${var.project_name}-run-ecs-task"
  description = "Trigger an ECS task when an image is pushed to ECR"

  event_pattern = jsonencode({
    "source" : ["aws.ecr"],
    "detail-type" : ["ECR Image Action"],
    "detail" : {
      "repository-name" : [aws_ecr_repository.main.name],
      "action-type" : ["PUSH"],
    },
  })
}

resource "aws_cloudwatch_event_target" "ecr_push_target" {
  rule      = aws_cloudwatch_event_rule.ecr_push_rule.name
  target_id = "run-index-py-function"
  arn       = aws_lambda_function.invoke_task.arn
}

Lambda

We use Lambda function to invoke training task in ECS. The content of the Lambda function is like below:

import json
import logging
import os
import sys
import boto3

# Setting up logging
logger = logging.getLogger()
for h in logger.handlers:
    logger.removeHandler(h)
h = logging.StreamHandler(sys.stdout)
FORMAT = "%(levelname)s [%(funcName)s] %(message)s"
h.setFormatter(logging.Formatter(FORMAT))
logger.addHandler(h)
logger.setLevel(logging.INFO)

ecs = boto3.client("ecs")


def run_ecs_task(cluster, task_definition, subnets, security_groups):
    """
    Function to run an ECS task.

    Parameters:
    cluster (str): The name of the ECS cluster.
    task_definition (str): The ARN of the task definition.
    subnets (str): The subnets for the task.
    security_groups (str): The security groups for the task.

    Returns:
    None
    """
    try:
        response = ecs.run_task(
            cluster=cluster,
            taskDefinition=task_definition,
            launchType="FARGATE",
            count=1,
            networkConfiguration={
                "awsvpcConfiguration": {
                    "subnets": subnets.split(","),
                    "securityGroups": security_groups.split(","),
                    "assignPublicIp": "ENABLED",
                }
            },
        )
        logger.info(f"Response: {response}")
        failures = response.get("failures", [])
        if failures:
            logger.error(f"Task failures: {failures}")
    except Exception as e:
        logger.error(f"Error running ECS task: {e}")


def lambda_handler(event, context):
    """
    AWS Lambda function handler.

    Parameters:
    event (dict): The event data passed by AWS Lambda service.
    context (LambdaContext): The context data passed by AWS Lambda service.

    Returns:
    None
    """
    try:
        # Get configuration from environmental variables
        ECS_CLUSTER = os.environ["ECS_CLUSTER"]
        TASK_DEFINITION_ARN = os.environ["TASK_DEFINITION_ARN"]
        AWSVPC_CONF_SUBNETS = os.environ["AWSVPC_CONF_SUBNETS"]
        AWSVPC_CONF_SECURITY_GROUPS = os.environ["AWSVPC_CONF_SECURITY_GROUPS"]

        logger.info(f"ECS_CLUSTER: {ECS_CLUSTER}")
        logger.info(f"TASK_DEFINITION_ARN: {TASK_DEFINITION_ARN}")
        run_ecs_task(
            ECS_CLUSTER,
            TASK_DEFINITION_ARN,
            AWSVPC_CONF_SUBNETS,
            AWSVPC_CONF_SECURITY_GROUPS,
        )
    except Exception as e:
        logger.error(f"An error occured while running ECS task: {e}")

Basically, it sends an API call to an ECS cluster to start the task using the AWS SDK (boto3). Please note that you need to specify some settings, such as the ECS cluster name, task definition ARN, VPC subnet, and security groups, to invoke the task. These settings are acquired through the environment variables embedded in the Lambda runtime.

To build this handler, we need to prepare the Lambda function in Terraform as shown below:

# Lambda function
resource "aws_lambda_function" "invoke_task" {
  # If the file is not in the current working directory you will need to include a
  # path.module in the filename.
  filename         = "lambda_function.zip"
  function_name    = "${var.project_name}-invoke-task"
  role             = aws_iam_role.lambda_execution_role.arn
  handler          = "invoke_task.lambda_handler"
  source_code_hash = data.archive_file.lambda.output_base64sha256
  runtime          = "python3.9"
  environment {
    variables = {
      ECS_CLUSTER                 = aws_ecs_cluster.main.name
      TASK_DEFINITION_ARN         = aws_ecs_task_definition.main.arn
      AWSVPC_CONF_SUBNETS         = "${aws_subnet.private1a.id}"
      AWSVPC_CONF_SECURITY_GROUPS = "${aws_security_group.ecs.id}"
    }
  }
}

An important point here is properly setting environment variables so that the Lambda function gets the necessary information to run the training task. Also, keep in mind to avoid hardcoding environment variables for better security and operational efficiency. For a more secure solution, I highly recommend using AWS Secrets Manager or AWS Systems Manager Parameter Store instead of using environment variables.

ECS cluster

# Task Definition
resource "aws_ecs_task_definition" "main" {
  family                   = "${var.project_name}-task"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = "2048" # 2 vCPU
  memory                   = "8192" # 8GB RAM
  task_role_arn            = aws_iam_role.ecs_task_role.arn
  execution_role_arn       = aws_iam_role.ecs_task_exec.arn

  container_definitions = jsonencode([
    {
      name      = "${var.project_name}-container"
      image     = "${aws_ecr_repository.main.repository_url}:latest"
      cpu       = 2048
      memory    = 4098
      essential = true
      portMappings = [
        {
          "containerPort" : 80,
          "hostPort" : 80
        }
      ],
      logConfiguration = {
        options = {
          "awslogs-create-group": "true",  
          "awslogs-region"        = "ap-northeast-1"
          "awslogs-group"         = "${var.project_name}-log-group"
          "awslogs-stream-prefix" = "ecs"
        }
        logDriver = "awslogs"
      }
    }
  ])
}

Training ML model usually requires GPU, but I chose CPU because it doesn't demand as many computing resources. Also, GPU is only suported for ECS on EC2, which requires more complex settings.

There's a bunch of resources you need to define, but I won't cover all of them here to keep this post simple. If you're interested in the complete resource settings, please refer to the repository below:

hikarunakatani/cifar10-aws: Simple MLOps workflows
https://github.com/hikarunakatani/cifar10-aws

CI/CD of Infrastructue using GitHub Actions

As we defined Infrastructue using Terraform, we can apply CI/CD practice to infrastructure. We use GitHub Actions to build CI/CD pipeline. The definition of the workflows is as follows:

# Execute terraform apply when changes are merged to main branch

name: "Terraform Apply"
on:
  push:
    branches: main
env:
  TF_VERSION: 1.6.5
  AWS_REGION: ap-northeast-1

jobs:
  terraform:
    name: terraform
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: write
      pull-requests: write
      issues: write
      statuses: write
    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - uses: aws-actions/configure-aws-credentials@v1 # Use OIDC token
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Terraform setup
        uses: hashicorp/setup-terraform@v1
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Setup tfcmt
        env:
          TFCMT_VERSION: v3.4.1
        run: |
          wget "https://github.com/suzuki-shunsuke/tfcmt/releases/download/${TFCMT_VERSION}/tfcmt_linux_amd64.tar.gz" -O /tmp/tfcmt.tar.gz
          tar xzf /tmp/tfcmt.tar.gz -C /tmp
          mv /tmp/tfcmt /usr/local/bin
          tfcmt --version

      - name: Terraform init
        run: terraform init

      - name: Terraform fmt
        run: terraform fmt

      - name: Terraform apply
        id: apply
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        # Make apply results comment on commit 
        run: tfcmt apply -- terraform apply -auto-approve -no-color -input=false

This is an example of a workflow for the apply process. I set the trigger for this workflow to activate on pull requests, so the terraform apply command runs when a pull request is opened in the repository.
When you want to manipulate AWS resources from GitHub, obviously you need to set AWS credential information. However, directly putting secret information in your repository poses security risks. Instead, you can use an OIDC token to get temporary AWS credential information. This way, you only need to put the ARN of the IAM role in your GitHub account, which is much safer.

Once the workflows have executed properly, you can view the results in the "Actions" tab on GitHub, like this: