AWS Batch is a scheduled executor based on a job queue. At my $job, we were evaluating it for dynamic workloads, whereas a container service is needed to execute dynamic workloads based on a queue. The general workflow looks a little like this:
The external worker submits a job -> the job is scheduled on a spot instance -> ECS takes over and executes the task -> results are logged.
Given that most applications communicate with different external systems, there would a wide variety of IAM configuration and container scripts. For simplicity purposes, and to provide a generic system, this batch job will copy /etc/motd to a parametrized S3 bucket.
First, define out the VPC, and private subnets. In my use case, I have predefined which VPC through a variable (vpc_id), and then dynamically lookup the subnet through tags with the key/value of subnet/private.
For this topology, I am utilizing VPCE Endpoints, such that my containers remain locked down on available egress, however according to AWS' Setting Up with AWS Batch, they recommend you can just configure open traffic.
Utilizing the principle of least privilege is important, however, for simplicity, I am utilizing largely AWS managed roles. Batch with ECS requires two roles, first the Batch Role which allows the service to create ec2 instances, create and modify the auto-scaling group, etc. Second is the ECS service role which provides two purposes, being the task execution role (permissions needed for your container) and the service role. It is worth noting that during my research, I did not see a breakdown of task execution vs task role, a feature for which the ECS service itself provides.
Batch Instances/Service Roles:
Additional policies for uploading to s3:
Batch is compiled of several pieces:
- Compute Environment
- Job Definition
Batch allows you to configure any variety of the EC2 flavors you want to configure. For this concept design, I went with strictly optimal spot instances, however for the production workloads, it’s likely the environment won’t be as ephemeral and some on-demand instances might be required.
To ensure the most optimally secure environment, I had to create a task definition to accomplish two main purposes:
- Ensure the base volume is encrypted
Requesting a blank spot instance won’t encrypt your volumes at rest. To do so, you must define the block_device_mappings, ensuring setting encryption to true.
- Utilize Amazon Linux 2 over 1, to provide the latest patches.
By default, the optimized spot instance ship with Amazon Linux 1, which was last updated in March of 2018. Utilizing the parameter store, AWS provides the ability to dynamically look up the image id:
The queue is where definitions will be associated with compute environments. If designed for production, it’s possible to combine different types of compute fleet.
This is where the meat of the operations will happen. When you submit a job to the queue, you specify a definition for which you want to utilize, which then is like the cookbook/playbook for the instance. This is important because the definition is what gets defined for compute requirements, entry point command, etc. Job definitions also allow for parameterization, such that you can create dynamic workloads.
Below I break down properties of the container, which highlights the parameterized implementation:
To wrap the job altogether, simply submit a job to the queue, and watch the magic happen.
aws batch submit-job --job-name test --job-queue queue --job-definition batch-job-definition --parameters BUCKET_NAME=s3://quack-batch-testing