Raphael Jambalos

Posted on Aug 18, 2019 • Edited on Nov 20, 2019

How we lost $800/mo with Amazon ECS Fargate

#aws #devops #docker #ecs

It is well-known that containerizing an application can help reduce server costs. But if not designed properly, it can increase other costs such as bandwidth costs. In this post, I’ll tell you about how we raked up an $800 bandwidth charge when we first used ECS-Fargate.

But first, a little bit of context...

When you first deploy an ECS Service, the ECS agent fetches your Docker image from an image registry like Dockerhub or ECR. With this downloaded image, the agent will spawn a docker container via the command docker run your-docker-image. ECS then runs a health check to see if your application is running. If it passes the health check, the load balancer redirects traffic to the container. If it fails several health checks, the container is killed. The docker agent then attempts to start another container from that same image.

ECS has two types of services, and they differ on how they handle restart attempts.

(1) ECS-EC2

In ECS-EC2, you manage the fleet of EC2 instances that runs your containers. The number of containers you can run is limited by the CPU and memory capacity of your fleet. If an instance doesn't have the image, it downloads it once and stores it locally. Hence, after the first download, the image is already in the instance. When your docker agent does docker run, it fetches the image locally.

(2) ECS-Fargate

The underlying EC2 instance in which your container runs is abstracted from you. You don't have access to the EC2 instances running your containers. AWS manages these instances for you; hence, the service becomes serverless. For a bit of a premium, you are freed from the operational burden of managing a fleet of EC2 instances.

There is a high possibility that the EC2 instance that your container runs on the first time isn't the same as the one it runs on the second time. Hence, the agent has to fetch your image from ECR every time ECS attempts to spawn another container.

This is where we got charged so much...

I was migrating our services from ECS-EC2 to ECS-Fargate. However, I was not able to properly set up one service. I left the service in a misconfigured state. Since the containers the service made was misconfigured, it never ran the application inside it. So, it just keeps failing health checks. After a few failures, the container gets destroyed and the service attempts to make another one. Since the instances underneath ECS-Fargate containers keep changing, it keeps needing to download my 500MB docker image every time it restarts. Imagine how 500MB every 2-3 minutes easily got to 16TB in one month.

How we found out...

One of my past times at work is examining our AWS bill. I was expecting significant savings because we moved from having 10 m5.large EC2 instances to just 1 m5.large instance and several ECS-Fargate containers. But the savings were just less than half of what I expected. So I dug deeper and found out that our NAT Gateway charges had a 6x fold increase. Our bandwidth consumption went from 38GB/mo to 16TB/mo!

The NAT Gateway is one entity through which resources in the private subnet access the internet. It charges for $0.045 for every GB that flows through it.

A little experimentation

Since the cost increase coincided with the upgrade to ECS Fargate, I decided to turn off all our containers for a few minutes. The bandwidth consumption suddenly went down:

To narrow down on a particular service, I decided to turn on everything except that one service that I left misconfigured. The costs suddenly went back up again!! That’s when I discovered that leaving a service in a state of misconfiguration in ECS Fargate increases costs.

Moral of the story...

Never leave your ECS services in a state of misconfiguration. If you can't finish the setup, at least put the container count to zero so it does not keep on spawning containers.

Also, use your AWS bill as a feedback mechanism. A sudden, unintended cost in one aspect of your bill can mean something has gone wrong.

Special thanks to Allen, my editor, for helping this post become more coherent.

I'm happy to take your comments/feedback on this post. Just comment below, or message me!

Top comments (22)

Hans Christian Alsos • Jan 16 '20

Interesting report, but in my view this sounds more like a problem with your vpc setup and nat-gateway. Have you considered using vpc endpoints for ecr access?
This would allow you to read/write to ecr without going though your nat-gateway, and by doing so, reduce your cost related to the nat-gateway.

Raphael Jambalos • Jan 22 '20

Ohh, that's a great insight. I think that would be the most appropriate solution to this problem. I'll try that on my setup. Thanks Hans! :D

Ali • Apr 25 '20

I am looking to add VPC endpoints to avoid crazy NAT gateway bandwidth charges.

My understanding is that 'gateway' type of endpoints are free but 'Interface' costs money. ($0.01 per hour + $0.01 per GB at the time of writing)

S3 endpoints can be a 'gateway' type but ECR endpoints need to be 'interface' type.

so... I am not clear on these two things:

(a) Am I right to assume that, since ECR image storage is actually provided by S3, I would just need to have an S3 type of VPC endpoint to avoid these huge NAT gateway bandwidth charges?

(b) Do I need to have ECR and S3 endpoints together and my cost saving is going to be paying $0.01 per GB instead of $0.045 per GB (nat gateway price)?

Chayanika Khatua • Nov 16 '23

Hi! I was inspired to research this based on your question, the effect of the Gateway endpoint alone vs ECR endpoints: dev.to/chayanikaa/cost-optimisatio...

davis • May 23 '20 • Edited

I've been worried about exactly this after building something with Fargate recently. I noticed the 'restart loop' behavior if you push up a crashing version, and it scared me into looking closer at how they bill for that image transfer. Happy to come across your post and discussion in the comments, but hate it caused such a big bill for you all. That's scary.

I'm a bit confused in my case, though, because my image in ECR is like 1.5GB and I'm not seeing any sort of transfer charges for that data. I have Fargate charges already, but nothing related to the transfer of the image (it's been a week or two with hundreds of cold starts). Is it possible those charges are much more delayed than Fargate's?

If I'm not actually being charged for it, how is that happening? I actually have 0 NAT Gateways on my account (that I can tell), yet I'm able to use my image from ECR in Fargate tasks.

Should I expect a big surprise bill coming soon? Based on the phrasing of their docs, I would have expected a cost of roughly 10¢ per cold start in my case since the first one I do blows past the ECR free tier for transfer out.

davis • May 23 '20

Found this in the ECR pricing documentation:

Data transferred between Amazon Elastic Container Registry and Amazon EC2 within a single region is free of charge (i.e., $0.00 per GB).

I guess since EC2 is underlying Fargate and I'm using the same region, it's free in this case.

Question: What was the need for your NAT Gateway in your configuration? I don't have one and ECR <–> Fargate seem to be communicating fine.

Raphael Jambalos • May 25 '20

Do you have VPC endpoints set up for your network? That's probably why your fargate instances can fetch ECR repositories without much charges.

If you don't have NAT gateway set up (and no VPC endpoint), you're probably using your Fargate containers in your public subnet where its using the Internet Gateway to fetch images from ECR. I'm not sure if Internet Gateways have a per-GB charge, I think it doesnt. If this is the case, you probably don't have to worry about this problem

davis • May 25 '20 • Edited

Thanks for the insights!

Jack • Feb 29 '20 • Edited

Great read! thanks for sharing your story.
I wish there was an easy way to monitor the cost.

I also found it very hard to calculate fargate cost.

Does anyone know if AWS provides any cost calculator?.
I found this website fargate.org/ which does an OK job... but it's not complex enough for my needs

Chayanika Khatua • Nov 16 '23 • Edited

Excellent article! I was wondering what solution you ended up implementing for this, all the interface endpoints or just the Gateway one?

I looked into this a bit with NAT Gateway metrics and different compositions: dev.to/chayanikaa/cost-optimisatio...

Peter P • Jun 23 '20 • Edited

Ah, thank you for this article. I am glad to know I am not alone.

We got hit by this with our ScheduledTasks which run in Fargate. We were testing development out of ScheduledTasks in a new Fargate cluster and we set a scheduled task to run every minute. So every minute we were downloading the image through our NatGateway!! Ack! Our excess bill was in the $2k range :((

Indeed, it seems the solution is VPC endpoints. Its crazy that AWS doesn't make this mandatory. Why would anyone want to go out to the public internet for their container on ECR?

Raphael Jambalos • Jun 25 '20

AWS is so sneaky with this hidden charge. I feel crazy for missing it for months. I agree that VPC endpoints are the permanent fix for this. But I think a team should have someone reviewing their AWS bill every month to look into every charge so this kind of mistake can be caught early.

Romaric P. • Apr 25 '20

Hi Raphael, very interesting feedback. Your experience sounds so familiar to me... That's one of the reasons why I have built Qovery. To have all the advantages of containers and AWS, without the disadvantages.

jappyjan • Sep 15 '20

How does scaling work worth qovery? It’s not mentioned on their site...

Ram • Nov 23 '19

Hi Raphael,i came across the simar issue by using aws fargate and got billed very high couple of months back. I have raised the issue with AWS and got the refund.