Why you may want to use AWS Batch on top of ECS or EKS

Since AWS Batch is an overlay on top of other AWS container orchestration services (ECS and EKS), we sometimes get the question about why use it at all.

Here are some reasons you may want to consider AWS Batch for handling batch-style and asynchronous workloads:

The job queue. Having a place that actually holds all of your tasks and handles API communications with ECS is actually a large value add.
Fair share scheduling - in case you have mixed workloads with different priorities or SLA, a fair share job queue will allow you to specify the order of placement of jobs over time. See this blog post for more information.
Array jobs - a single API request for up to 10K jobs using the same definition. As mentioned Step Functions has a Map function, but underneath this would submit a single Batch job or ECS task for each map index, and you may reach API limits. The Batch array job is specifically to handle the throughput of submitting tasks to allow for exponential back off and error handling with ECS runtask.
Smart scaling of EC2 resources - Batch creates an EC2 autoscale group for the instances, but it is not as simple as that. Batch managed scaling will send specific instructions to the ASG about which instances to launch based on the jobs in the queue. It also does some nice scale down as you burn down your queued jobs to pack more jobs on fewer instances at the tail end, so the resources scale down faster.
Job retry - you can set different retry conditions based on the exit code of your job. For example if your job fails due to a runtime error, don't retry since you know it will fail. But if a job fails due to a Spot reclamation event, then retry the job.

The above is not a complete list, but just some highlights. The following are some things about Batch to be aware of if you are thinking of using it for your workloads:

It is tuned for jobs that are minimum 3 to 5 minutes wall-clock runtime. If your individual work items are < 1 minute, you should pack multiple work items into a single job to increase the run time. Example: "process these 10 files in S3".
Sometimes a job at the head of the queue will block other jobs from running. There are a few reasons this may happen, such as an instance type being unavailable. Batch just added blocked job queue CloudWatch Events so you can react to different blocking conditions. See this blog post for more information.
Batch is not designed for realtime or interactive responses - This is related to the job runtime tuning. Batch is a unique batch system in the sense that it has both scheduler and scaling logic that work together. Other job schedulers assume either a static compute resource at the time they make a placement decision, or agents are at the ready to accept work. The implication here is that Batch does a cycle of assessing the job queue, place jobs that can be placed, scale resources for what is in the queue. The challenge here is that you don't want to over-scale. Since Batch has no insight into your jobs or how long they will take, it makes a call about what to bring up that will most cost effectively burn down the queue and then waits to see the result before making another scaling call. That wait period is key for cost optimization but it has the drawback that it is suboptimal for realtime and interactive work. Could you make it work for these use case? Maybe but Batch was not designed for this and there are better AWS services and open source projects that you should turn to first for these requirements.

Again, not a complete list, but these represent some of the common challenges I've seen for new users.

Hope this list helps, and if you have any questions, leave a comment below.