Flash traffic is one of the pitfalls of autoscaling. The most famous example of this kind of traffic is Black Friday. How to tackle scaling difficulties? Let’s find out.
Compared to your normal traffic baseline, flash traffic is an outlier; it’s often multitudes higher than the amount of traffic you receive on average. Another characteristic is its steep curve. Usually, a surge in traffic starts immediately from the moment a campaign begins. These characteristics make it very hard to wire them into an autoscaling algorithm. At DPG media, we also encounter flash traffic, but we experience it in live news updates. Here’s our journey describing how we tackled the scaling difficulties regarding flash traffic.
This article about autoscaling is built around DPG Media’s (micro)service to serve live news updates for digital news media. For simplicity, let’s call this service “live-service” for the remainder of this article.
At its core, live-service provides live updates for various events: sports games, disasters, terrorist attacks, voting results, etc. From the list of events, you can probably already guess that the live-service serves very popular content most of the time. It’s both viral content with a high view count and live content through a constant stream of updates.
If you had taken a look under the hood of the live-service a while ago, you would have found a couple of ECS Fargate containers running a Spring Boot application. In front of the containers, an application load balancer acted as the origin for our Content Delivery Network (CDN).
In fact, the amount of traffic to the live-service is of such a magnitude that it wouldn’t survive with a CDN in front of it. And while most often, putting a CDN before your content is trivial and straightforward, this was not the case for the live-service.
Our first learning was that although we achieved a very high offload percentage to our CDN, the pressure on the live-service was too big every time the CDN cache was purged. It was an interesting pattern to watch: as long as the CDN served from cache, the backend felt no pressure. However, as soon as the CDN cache was invalidated, a massive swarm of requests hit the live-service as a traffic flash flood wave. In a split second, the surge in traffic allowed no time to scale or react at all — the live-service just drowned. 😓
This is what traffic reaching the origin (live-service) looked like:
For live-service, every CDN cache purge felt like a DDoS attack was unleashed against it. The problem is twofold: too many requests made it to the backend AND all requests were launched in the same split second.
It’s probably already clear by now that flash traffic is a terrible pattern for autoscaling. Add the unpredictable timing of news updates, and you may wonder if there is any room for autoscaling. There is! So let’s take you through it.
The first issue to tackle was the total failure in case of a very high traffic flood -live-services became completely unresponsive and unable to serve cache updates any longer. To overcome this, implement fail fast patterns. In the end, all a CDN needs is a single response to refresh its cache; it’s okay if all other requests fail.
Failing fast only, however, was not enough. It was about time to take a closer look at our CDN’s features, Akamai in this case. To help lower the load, we first enable Cache refreshing, the first nifty trick to take the pressure off the origin.
Akamai Prefreshing allows you to always serve content from cache and making cache-refresh calls to origin asynchronously. It avoids that clients are served slower responses when the cache has expired. This is achieved by eliminating the wait for an origin forwarded refresh call.
Here’s a Prefreshing example:
Suppose an object’s cache Time To Live (TTL) is set to 10 minutes. Once the cache is populated, all subsequent requests within the next 10 minutes are served from cache. However, when Prefreshing is set to 90% of the TTL, a request that arrives after minute 9 will get returned the cached object. Akamai will also forward a refresh request to the origin asynchronously to keep the cache warm for another 10 minutes.
We must say that asynchronous cache refresh doesn’t solve all problems. With popular content, it’s expected that numerous clients request the same content simultaneously; Prefeshing doesn’t solve that. Take the above example and imagine that thousands of concurrent requests arrive in minute 9. Akamai will serve all those requests from cache, but it will also launch a request to the origin for every single one of them. This is the traffic flash food we mentioned earlier.
To avoid a flash flood, Akamai has a hidden feature called “make-public-early”. This feature, which Akamai can only turn on on request, only forwards one refresh call to the origin and tells the other requests to wait for that one answer. This feature is also known as request collapsing, and its purpose is to reduce trips to the origin.
A little side note on cache tags: cache tags come in handy with live-service. Using cache tags, you can link objects together to apply the same cache logic to them. Imagine an article, a live blog, and a ticker, each publishing the current score for a particular soccer game. By adding the same cache-tag (for example, match-id) to all these items, you can cache them indefinitely and purge the cache of all items with one purge-command when the score changes.
While the CDN setting solved the issue, but with everything we learned along the way, we found an even better solution. If you look at live-service nowadays, you see it uses a different origin, requiring a different scaling approach. While most of our services scale using AWS autoscaling principles, live-services handled scaling entirely over to the CDN. In its current setup, live-services is only responsible for keeping the CDN’s origin fresh. To do so, it creates static assets on S3 whenever there’s a news update.
Once again, S3 feels like a swiss army multitool. With live-services putting its static assets on S3, the CDN has only to point to S3 for its origin. The availability and durability of S3 ensure that we can sleep peacefully. Nevertheless, must that not be enough, there’s always stale serving from Akamai.
To wrap up, the live-service is a perfect example of why I started working for DPG Media. With quite a track record, I have never encountered the high traffic volume that hits DPG Media’s army of services. At DPG Media, I learned to cope with traffic on another level, in live-service, and every other aspect of IT, from big data and machine learning to database transactions and messages. It’s fun and still very challenging. Because high traffic brings high impact, sometimes it’s high stressy as well.
Enjoy and until next time!