AWS ECS Restart Loops

#aws #ecs #containers

When you use Amazon Web Services (AWS) Elastic Container Service (ECS), you may encounter a situation where ECS tries to deploy one of your ECS services, but your tasks are unhealthy, so it tries again.

And again. And again. And again. And again.

ECS is very determined to start healthy tasks for you, and if you're not careful it may continue to try relentlessly, for days, weeks, months.

In the process, it may be creating log streams, allocating elastic network interfaces, downloading container images from the internet, incurring NAT gateway charges, making AWS config recordings, logging CloudTrail events and EventBridge events.

In an extreme case, this could trigger a pretty extreme spike in billing that might take you hours to track down.

I'm going to write up a few ways to avoid ECS restart loops and to detect them so that you can intervene in the next couple of posts, but in case this is unfamiliar territory, I'll start with some definitions in and a demonstration.

Definitions

Amazon Web Services (AWS) has a number of approaches to deploy and run software applications in a containerized model. Elastic Container Service (ECS) is one of the oldest and simplest AWS services to run software in a container, and it's usually what I'd recommend for someone getting into containers on AWS for the first time.

When you want ECS to run containers, you define a ECS Task: essentially a deployable unit with metadata about the container(s) to run, how they should be networked, what resources they need.

At that point you can ask ECS to create a task using that definition on an ECS/EC2 instance you control or on AWS Fargate, but typically you're more likely to wrap that ECS task in an ECS Service, which will make it easy to run a number of tasks in parallel, often in different availability zones, load balance them, restart them if they fail health checks, and so on.

The AWS ECS documentation has a Amazon ECS Components section that describes the elements of ECS if you're new to all of this.

One of the capabilities that ECS services add is resilience. What happens if the EC2 instance your task running on becomes unresponsive or is terminated? ECS can detect that and replace that task with another one running on another instance. What if your task has a memory leak and crashes? ECS can detect that your task is unhealthy and shut it down and start another. Many of the ways that ECS can intervene to keep your service healthy and robust revolve around shutting down and starting up the containerized tasks you've defined.

But ECS doesn't know the internals of your application. It doesn't know why your task is shutting down or failing health checks. So if it starts up a task and that task also fails, all it can do is keep trying. If it tries repeatedly and the tasks fail repeatedly, that's a restart loop.

Restart Loop

Why might a task fail and when? There are lots of scenarios:

The cluster might not have the necessary resources to start the task, causing the task to fail during startup, never reaching the RUNNING state.
The container might contain a crashing bug, causing the process to exit on startup.
The container might contain a crashing bug that requires time or a particular scenario to occur, causing the process to exit after being healthy and functional for some time.
The container might be healthy, but contain a faulty health check, causing ECS to believe the task is not healthy. This would allow the task to start, but ECS will terminate the task after running the initial health checks.
The container might be healthy and health check might be correctly defined, but the security group might prevent the health check from reaching the container, causing ECS to terminate the task after health checks.
The container's health check might depend on an external resource that isn't available. This could happen at any time, causing ECS to terminate the task. (Incidentally, this is a good reason to isolate health checks from the environment.)
Something about the environment (VPC, networking, security groups, etc.) might change in a way that causes the container or the health check to fail. This could happen anytime.

And what should ECS do when a task fails? If that task is part of a service, ECS is supposed to maintain a certain number of healthy tasks (as configured in the desiredTasks parameter). If a task terminates, it should be replaced by starting a new task. If a task fails health checks, it should be terminated and replaced.

But what if a task is replaced and the replacement also fails immediately? Should ECS continue to try? For how long? There's no one right answer here:

If there's a problem with the container or health check, the task will never succeed, and ECS is wasting cloud resources trying to "fix" a service that is fundamentally broken.
If there's a temporary problem with the environment or the dependency, restarting the task may not be helpful, but the service may return to health when the temporary problem ends.
If the task is flaky (occasionally crashes, memory leak, etc), restarting the task each time it fails is probably the best choice.

But should ECS detect that replacing tasks isn't working and change its behaviour? Perhaps. Until recently, ECS would just keep trying by terminating and replacing tasks indefinitely.

There was always a way to manually intervene -- even if you don't know what's causing the restart loop, you might decide that the restarts aren't helping. If you want it to stop, you can set desiredTasks to 0. This means that when a task fails, ECS wouldn't need to replace that task, ending the restart cycle.

Demonstration

What does it look like when an ECS service goes into a restart loop? If you look at an ECS service in a restart loop, you'll see a series of events to:

start / stop tasks
register / degregister targets from load balancer

If the failure happened during a deployment, you might also see a deployment in progress.

What's next?

So if you're having a problem with ECS services going into a restart loop or you're simply wondering how to protect yourself from it, what should you do?

In the second post, I'm going to go over some the new features added to ECS in the last few years that can help deal with service restarts, including: