(NOTE: This was originally published in October 2022 🙂👻)
Halloween is around the corner. Buckle up for a spooky engineering ghost story.
A few years ago, I worked as a software engineer at a large company building a video streaming service. Our first customer was a major professional sports league who would be using our service to broadcast a livestream of their games once a week to millions of viewers; an opportunity that was both exciting and terrifying!
‍When we signed on to the project, our service didn’t actually exist yet. But the league’s broadcast schedule certainly did. 🙂 The launch date was rock solid, and the service had to be able to handle all traffic being sent to us.
‍Is this where the scary part of the story begins? Nope! We had a fantastic engineering team and an architecture design we believed in. The schedule was tight, but we were confident we’d be able to hit our launch date. We put our heads down and got to work.
‍A few weeks before the first broadcast, we were feeling pretty good. The service was built, sans some finishing touches. The team was in the home stretch of load testing to make sure the service would hold up to the traffic at game time, and everything was business as usual.
‍But then…
‍We got our first realistic sample data set from our customer, and we integrated it into our load tests. It did not go smoothly. Based on our budget and our estimates for how much data we would need to store, we had configured a maximum read and write capacity for DynamoDB. But during the load test, we found that we were dramatically exceeding that capacity and running into DynamoDB throttles. Our service failed. Hard.
Be afraid. Be very afraid.
Uh oh. It’s only a few weeks until our first broadcast, and we have a major problem. In our architecture design, there were data we needed to store for each individual viewer watching the broadcast to keep track of where they were in the stream. We had decided to store this data in DynamoDB. After investigating the traffic that the broadcaster was sending us, we discovered the size of the payload for each viewer might be up to 10x larger than our estimates. This required 10x the IOPs on DynamoDB—and 10x the costs!
‍Our workload was very write-heavy. Some napkin math based on the observed 10x increase in data made it clear that storing it in Dynamo would put us far over budget. These data were ephemeral, so we decided that we could move them out of DynamoDB and into a cache server. We did some quick research on our options and decided to move forward with a managed Redis solution.
‍Managed Redis services have some nice benefits in that you aren’t explicitly responsible for provisioning and operating the individual nodes in your cache cluster. But, you *are* explicitly responsible for determining how many nodes you need in your cache cluster, and how big they need to be.
‍The next step was to write code to simulate the load that we would put on the Redis cluster, and run it... over and over again. We tested different sizes of nodes. We tested different cluster sizes. We tested different replication configurations. We tested. A lot.
‍All this writing of synthetic load tests to size a caching cluster was not work that we had accounted for in our engineering plans. Experimenting with different sizes (and types) of cache nodes, monitoring them to ensure they weren’t overloaded during the test runs… These tasks were expensive and time consuming—and largely ancillary to the actual business logic of the service we were trying to build. None of them were especially unique to us. But we still had to allocate precious engineering resources to them.
‍After a week, we had nailed down the sizing and configuration for our cluster, still racing against the clock. After another week, we had completed the work to migrate that part of our code off of Dynamo onto the Redis cluster.
‍And the service was up and running again.
It’s alive! It’s aliiive!
We did it! The first broadcast went smoothly. As with any major software project, after observing it in action in the real world, we learned some lessons and found some things to improve, but the viewers had a good viewing experience. We rolled out some of those improvements during the subsequent weeks, and before we knew it, the season was well underway. Victory!
Until…
‍About a month into the season, we got our AWS bill. To say that it caused us a fright would be an understatement. The bill was… HUGE! What the heck happened?!
‍## It’s coming from inside the house!
Because of our architecture, we knew that the biggest chunk of our bill was going to come from DynamoDB. But we had done a reasonable job of estimating that cost based on our DDB capacity limits. So why was the AWS bill so high?
‍It turns out that the culprit was our Redis clusters. In retrospect, it was predictable, but we had been so busy just trying to make sure that things were operational in time to meet our deadlines, we hadn’t had time to do the math.
‍To meet the demands of our peak traffic during the games, we had been forced to create clusters with 90 nodes in them—in every region that we were broadcasting from. Plus, we needed each node to have enough RAM to store all the data we were pumping into them, which required very large instance types.
‍## Is this place haunted?
Very large instance types that provided the amount of RAM we needed happened to also come with high numbers of vCPUs. Redis is a single-threaded application, meaning it can only take advantage of one vCPU on each node in the cluster, leaving the remaining vCPUs almost 100% idle.
‍So there we were, paying for boatloads of big 16-vCPU instances, and we were guaranteed each one of them would never be using more than about 6% of the CPU it had available. Believe it or not, this wasn’t even the worst of it.
‍The peak traffic we would experience during the sports broadcasts dwarfed the traffic we were handling during any other window of time. So not only were we forced to pay for horsepower that we weren’t even fully utilizing during the games, but we were paying for these Redis clusters 24 hours a day, seven days a week, even though they were effectively at 0% utilization outside of the 3-hour window each week when we were broadcasting the sporting events.
‍And then the season ended and we had no more sports broadcasts for 6 months. So now those clusters were sitting at approximately 0% utilization 24-7.
‍Okay, fine. Problem identified. All we had to do was fix it and get our cloud bill under control!
‍## A horde of zombie… engineers!
Well, it turns out fixing our spend on our Redis clusters was much easier said than done. The managed Redis service didn’t have any easy, safe way to scale the clusters up and down. And because Redis clients handle key sharding on the client side, they have to be aware of the list of available servers at any given time, meaning that scaling the cluster in or out carries a high risk of impacting cache hit rate during the transition, and thus would need to be managed very carefully.
‍These were solvable problems. Throw enough engineers at something, and anything is possible, right? They could update all of the code so that it writes to two different clusters during a scaling event and have reads fail over from the new cluster to the old one for cache misses during the transition. Then, they could scale down by adding a second, smaller Redis cluster alongside the giant one needed for peak traffic. They could definitely handle the work of meticulously monitoring the behavior of the new code while the new cluster was brought online, and they could decide when it’s safe to begin the teardown of the old cluster. Oh, and they can kick that off and meticulously monitor it to make sure that goes smoothly.
‍So sure, our team was capable of doing that twice a week: once when we needed to scale up in preparation for the sports broadcast, and again when we needed to scale down to save costs after the event.
‍But that would be a ton of work. Now we were forced to do some math on how much we were paying those engineers vs. how much we were paying for the overprovisioned Redis clusters.
‍And then there’s the opportunity cost: none of this cluster scaling nonsense had any unique business value for us, and we had a limited number of engineers available to work on delivering features actually unique to our business and provide actual customer-facing value to our users.
‍I bet you can guess where we landed. Yep. We never reached a point where we felt like we could justify the engineering cost it would take to try to solve this problem when there were so many more valuable customer projects our engineers could be doing—projects which would actually move the business forward and win us new customers.
‍So we just kept paying. For something we weren’t using.
‍At a certain point, if our business was struggling, we might have been forced to allocate the engineering resources to solving this problem in order to reduce our spending and balance the budget. But this would have been a sign that we were in trouble.
‍And I don’t know how you feel about the cloud services your team spends money on, but I consider it pretty scary that a cloud service can make it so complicated for you to get a fair bill—a bill where you are paying a fair amount for what you are actually using, and not paying a ton of money for resources that are sitting idle—that you will only be able to make time for it if you’ve gotten into a desperate situation.
‍It’s a great business model for the cloud service provider. Not a great business model for the customer.
‍It doesn’t have to be this way.
‍## Momento Cache: All treat, no tricks!
The horrific tale you’ve just read was a large part of the inspiration for us to build Momento’s serverless caching product. One of the best things about serverless cloud services is the fair pricing model: pay for what you use and nothing more. Why should we settle for less with caching?
‍With Momento, you get a dead-simple pricing policy based strictly on how many bytes you send to and receive from your cache. We don’t think you should have to pay more if those bytes are all transferred within a 3-hour window or are evenly distributed over the course of a week or a month. As far as we’re concerned, you should be able to read and write your cache when you need it. That’s it. Plain and simple.
‍Of course, serverless doesn’t stop there. We manage all of the tricky stuff on the backend for you. If your traffic increases and your cache needs more capacity, that’s on us. If your traffic decreases, you shouldn’t have to pay the same amount of money for your low-traffic window as you did for your high-traffic window. And you most certainly shouldn’t have to pay for 15 idle CPU cores on a bunch of nodes in a caching cluster just because you needed more RAM.
‍So: stop letting cloud services trick you into paying for caching capacity that you aren’t using, and see what a treat it is to work with Momento today! You can create a cache—for free—in less than five minutes. If it takes more than five minutes, let us know, and we’ll send you some Halloween candy.
‍Visit our getting started guide to try it out, and check out our pricing page to see how we make sure you get what you pay for.
‍Happy Halloween! 👻
Top comments (0)