Discussion on: Why you should never use sticky sessions

View post

Replies for: The article is a good basic intro, thanks -- certainly in the context of how to design a new app. But a really LARGE app might need a really LARGE...

Thanks for the article, and congrats for making it into The Overflow newsletter where I found you!

"A really LARGE app might need a really LARGE session"... this got me thinking about my own situation. So here's a case study! Now that I've written it, our "sticky session" system is more like a distributed cache, but it bears some similarities. Thoughts?

I work on an app that serves procedurally generated content. When a user makes a request, we put together a video based on their history and request parameters, then we stream the video to them. Many videos get reused, but we put together a never-before-seen video for nearly half of the requests. We save a description of the video to a database so we can re-create it if needed, but we stream the video directly from the server instance that produces it. The streaming is, I believe, a sticky session.

We've considered copying the video to a shared S3 bucket and streaming it from there, but the initial lag from creating and copying the video would be too long. Next, we experimented with switching the source of the stream: start streaming from the server where the video is being created, but copy the video to S3 and switch to streaming from S3 once the video has been copied there. This wasn't actually better. We would have gotten performance and money gains from putting the video into a shared cache, and from freeing up disk space on the source server once the video was done copying, but these gains were fully offset by the increased CPU load from copying the file, the need for more S3 storage, and the extra network usage and complexity that we would have needed to pull this off.

Our setup suffers from the pitfalls of sticky sessions, but we've mitigated them somewhat.

When a server goes down, it interrupts all streams from that server. (Duh.) The user can get their video back, because we've saved the video description in our database, but it's kind of a pain. They have to request the same video again (which can take several clicks), then wait several seconds for us to re-create the video from its description.
Scaling could be better. When we change the number of servers, existing streams can continue uninterrupted, but any interruption (e.g. a flaky internet connection or redeploying a server) will result in the same annoying experience as a server going down, since the request will be redirected to a new server. Fortunately the app is just for recreational use. No one dies if we interrupt a video stream, and the interruptions from scaling are minor compared to the general flakiness of consumer internet.
DOS attacks are a concern, but our load balancing is opaque and random enough to make attacks on a single server unlikely. We balance the load modularly, based on the video description's ID in the database (e.g. multiples of 12 go to server 0, multiples of 12 plus 7 go to server 7). We don't expose the numerical ID, only a reversible hash, so from the point of view of an attacker (who hasn't cracked the hash), each request is assigned at random. If an attacker wanted to concentrate their DOS attack on server 0, they would either have to request the same video over and over (which is easy to counteract once we realize what's going on), or they'd have to know our hashing key and algorithm and know how many servers we have running so they could request the right video description IDs. They could try to request particularly CPU-intensive videos, but it wouldn't help much. The worst video description is only about 10 times as resource-intensive as normal, and the load balancing means that they can't force new video descriptions to be assigned to a particular server.

So, are these true "sticky sessions"? Is the way we've handled the sticky-session problems widely useful, or good for our use case only? (Or could we do better?)

George Koniaris • May 22 '20

Hey sciepsilon,

Thanks for providing a so detailed description!!! What you describe here is the same use case as having an open web socket connection. Of course, you are not going to renew the connection in every message just to avoid stickiness. Also, you can't have every video saved on every server. I haven't worked with video streaming, but a nice approach would be to have a master server for each video and at least one acting as a replica.

Based on your description, you are using AWS. I will make an assumption here, correct me if I am wrong. The assumption is that you use EBS to store your videos. So, one thing I would consider doing is having a mechanism to auto-mount a failed server's EBS storage (the place where you store videos) to a healthy server and use that server as the master one for these videos until the old one gets up again. This would require a good amount of DevOps so it's just a random idea by someone that doesn't know the internals of your web application.

Now, I don't consider streaming a video being a true "sticky session". The real problem would occur in case you didn't have a fallback mechanism (even if it takes a minute or two to rebuild the videos) in case of a server failure. I would really like to hear your opinion on this, as it's a special case I had never thought of.

Kalinda Pride • May 22 '20

I think you're right - my video-streaming example isn't really a sticky session. And yes, we're using EBS. :)

The "master and replica" idea is an interesting one. It's similar to copying the videos to S3 as they're created, but I'm assuming the replica would receive a copy of the request and generate its video independently, rather than receiving its video from the master. This would definitely increase reliability: when the master goes down or we redeploy to it, the replica can pick up right where it left off. With the right network architecture, I think we could even make the transition invisible, without the user having to make a new request or open a different socket connection.

Of course there's a cost too. Since we're generating each video twice, we would need double the amount of compute power for creating the videos. I don't think the tradeoff would be worth it for us, but it would be a clear win if we were a less error-tolerant application. For us, perhaps a hybrid solution is possible where we spin up replicas only during a deployment or when we're expecting trouble.

We've also taken a couple other steps to improve reliability that I didn't mention in my earlier comment. The biggest one is that we use dedicated servers for creating the videos, with all other tasks (including creating the video description) handled elsewhere. We deploy changes multiple times a day, but we only deploy to the video creation servers every few weeks, when there's a change that directly affects them. That separation, combined with the fact that our service is for recreation rather than, say, open-heart surgeries or high-speed trading, lets us be pretty lax about what happens during a deployment. :-P We also do some content caching with Cloudfront, but I don't think that really affects the issues we've been discussing.

I didn't know it was possible to mount a failed EBS server's storage to another server! I always assumed that a server was a physical machine with its CPU and storage in the same box. I don't think we'll actually do this, but I'd still like to learn more. Can I read about it somewhere?

George Koniaris • May 22 '20

Hi again!

I think that EBS storage can be mounted to another instance if it's first removed from the one that is mounted. I don't know what happens in the case that the server goes down, but if the failure is not in the EBS itself I think it can be mounted to the new one. Also, you don't have to generate the video in every replica, you can just use scp or rsync to clone the file to the server that you need. Of course that would double the cost of your EBS storage, but would greatly reduce the CPU load if you decided to use replicas. I think that this is the easiest way to keep replicas of your videos by just increasing the internal network load (as far as I know they have great internal networks so that wouldn't be a problem).

This is the first article that I bumped into, explaining how to mount EBS storage to a server.

devopscube.com/mount-ebs-volume-ec...

In the hypothetical case that you decide to mount the EBS storage to another server, you can also create a clone of the EBS itself, so you can mount the cloned version to the replica server. By doing this, when your master server goes up, the original EBS storage will still be mounted to the master server. Unfortunately, EBS is still priced even if the EBS volume is not mounted to an instance, so you may have to remove the cloned EBS after the problem has been resolved.

Personally, if I had to perform replication I would start by using rsync or scp as they are the easiest way and they don't require any extra devops.

Kalinda Pride • May 23 '20

Thanks!