One source of frustration is that your jobs get stuck in a 'RUNNABLE' state and don't progress. Here are some pointers you can use for troubleshooting this issue.
Read through the AWS Batch Troubleshooting guide and 'Why is my AWS Batch job stuck in RUNNABLE status'. If you carefully work through these guides then they should lead you to the source of the issue.
One thing to highlight is:
Container instances need access to communicate with the Amazon ECS service endpoint. This can be through an interface VPC endpoint or through your container instances having public IP addresses.
If you are using a NATGateway, then this shouldn't be a problem, because this is a standard solution to communicating with web resources from private subnets.
However, I was configuring service access via PrivateLink, meaning that requests are not routed over the public internet. Using PrivateLink brings with it additional steps for successfully running AWS Batch jobs.
Your batch job goes to RUNNABLE, but you don't see any ECS cluster instances created (refer to 'troubleshooting' for more on this).
Refer to Creating the VPC Endpoints for Amazon ECS. Your containers are going to need to be able to access the following endpoints:
The Security Group associated with each endpoint need to be able to send 'all traffic' outbound requests - refer to the troubleshooting guide.
The Security Group associated with the endpoint(s) needs to be able to accept TCP requests on port 443.
Refer to this guide for setting up VPC Endpoints to ECR.
The key points is that you will need two VPC Endpoints:
I found setup of AWS Batch with VPC Endpoints was not straightforward, but hopefully some of the pointers in this guide can save you time if you are stuck.