This past month we had a really high AWS bill, and it was supposed to be high but not THIS high. The culprit - "Data Transfer Out". I'm sure I'm not the only one who has had to deal with this.
We had a NodeJS service running on EC2 in an Auto Scaling Group (ASG) behind an Application Load Balancer (ALB) that was supposed to handle a huge volume of requests. We were expecting a consistent increase in traffic this month, so we had made estimates on how it would affect our costs. We had expected some high data transfer out of our Application Load Balancer (ALB) but this was a consistent 4-5x of our estimates. Time to investigate.
After filtering Cost Explorer data with tags, it was clear that the ALB was the source of these costs. I immediately went to CloudWatch and checked the "Processed Bytes" metric of the ALB.
Processed Bytes is supposed to be "the total number of bytes processed by the load balancer over IPv4 and IPv6 ... includes traffic to and from clients", so ideally Processed Bytes should be greater than the "Data Transfer Out". But it was not..?
Our stats showed Processed Bytes to be around what our estimates had predicted for this amount of traffic. But Cost Explorer was showing a number that was 4-5 times that. Time to contact support.
We raised a support request to AWS, but we didn't have high hopes because we were not on the premium support plan. But it was worth a shot.
Next up, we enabled ALB Access Logs to an S3 bucket and let it generate some logs for some time. Upon checking the logs, we were certain that this wasn't a problem at the application level. The access logs showed response size consistent with what we had estimated, no inflation due to extra headers or something like that.
We wondered if there was some other service running that might be communicating over an Elastic IP or public IP instead of a private IP. To check that, we enabled VPC Flow logs to another S3 bucket and let it generate some logs for some time.
We didn't find any services that might be communicating over Elastic IP but we did find some services that could be moved to the same Availability Zone (AZ) and reduced some Regional Data Transfer costs.
After eliminating all possible sources of Data Transfer we still had no clue what was the source of the Cost other than that it was from the ALB. This led to frantic googling and sending/explaining the problem statement to various people.
Then over the weekend, trying to think of keywords that would not lead me down to more documentation, I finally found this StackOverflow answer:
I'm curious to know how much total outbound bytes an ELB generates. As far as I can tell, this will be something like
(size of http responses) + (size of SSL handshake transactions)
I can calculate the former by looking at my web server logs. However, I'm having a hard…
A. The handshake of a TLS connection is pretty much constant and do not depend on your application. So your function is really : number of connections * ( size of http responses + constant)
And it finally clicked, the missing piece of the puzzle. Since HTTPS was handled directly by the ALB, it wouldn't show in the Access Logs as the instance didn't have to deal with it.
Here is an article that estimates the payload to average ~6.5k (depending on your certificate size). Our response sizes were really small, and this could be a major part of the response size if the user was visiting the site for the first time. Since a lot of the traffic was new, it would be a significant chunk.
That led to even more Googling on how HTTPS might affect AWS ELB costs and we came upon this very good article that also suggested increasing the default timeout for idle connections to 10 minutes for the ALB along with changing the certificate.
We were currently using the default certificate generated by Amazon Certificate Manager (ACM) with the Load Balancer. We tested the size of the TLS handshake with Wormly Test SSL Tool and then replaced it with one generated with Let's Encrypt. To be thorough we also tested TLS handshake sizes of different websites with different Certificate Authorities.
|Certificate Issuer||Certificate Type||SSL Handshake Size (Bytes)|
|ACM||Wildcard (RSA 2048 bits)||5971|
|Let's Encrypt||Single (RSA 2048 bits)||3753|
|Let's Encrypt||Wildcard (RSA 2048 bits)||3702|
|Let's Encrypt||Wildcard (ECC 256 bits)||3323|
|DigiCert||Single (ECC 256 bits)||3311|
|Sectigo||Wildcard (RSA 2048 bits)||6720|
DigiCert was the best with respect to size, but Let's Encrypt was close enough (& free!). ECC certificates are not compatible with some older browsers, so we generated both the smaller ECC and RSA certificates through acme.sh and uploaded them through AWS CLI. The ALB will automatically select the best one depending on the client (read more here).
After adding them to the ALB listener and letting Cost Explorer catch up for a day, we saw a significant decrease in our Data Transfer Out Costs for the same number of requests.
It was still not equal to the estimates we had, but it will never be because we never considered the TLS handshake size in our estimates. AWS Costs really are a mystery until you actually get the bill.