DEV Community

Cover image for Server Sent Events are still not production ready after a decade. A lesson for me, a warning for you!
Mike Talbot ⭐
Mike Talbot ⭐

Posted on

Server Sent Events are still not production ready after a decade. A lesson for me, a warning for you!

TL;DR

  • I've had an incident with SSE that's caused real client pain, I detail it below
  • Server Sent Events are a long established and recommended method of pushing from the server to the client
  • Many articles proclaim their benefits like automatic reconnection including catching up on missed messages, being easier and more reliable than sockets etc
  • Buried down in the spec, in the fine print, is something that renders them completely unreliable in uncontrolled environments (apps and public websites that must be generally available)
  • Don't expect your events to get delivered any time soon on some corporate or older networks

 Fire! Fire!

I hate feeling like an idiot, you know the score, I released a version of our software, automated and manual tests say it's fine, scalability tests say it's fine. Then 10 days later a client account manager says "XYZ Corp" are complaining that the software is "slow" to login.

Ok you think, slow to login let's take a look. Nope, nothing slow, no undue load, all the servers operating well. Hmmmm.

Client reports it's still "very slow" to login. Ok, eventually I think to ask "how slow?" - 20 minutes - wooooo - 20 minutes isn't "slow" to login, 20 minutes is basically utterly f**ked.

We look everywhere, everything is fine. Their account is fine. It must be a network thing and sure enough it is.

Since launch we've used Server Sent Events for notifications to the client, but recently we move to using it for more -> basically we send requests to the server which return immediately that they are enqueued and then the results pitch up later via a server event. It was a major clean up of our process and much faster and massively scalable. Except our events were never ever getting delivered for a small number of clients.

"Our events were never getting delivered for some clients" - oh crap, architectural failure, a reason for sporadic bugs has suddenly escalated into a priority 1, my underwear is on fire, cock up.

What's happening is this - something between our servers and the client's computer is screwing up the events, holding them forever. The reason it ever works is that every few minutes it reconnects and the "reliability" of SSE means we get the messages that were swallowed.

Cue a bunch of devs and devops scouring the internet for what is actually happening, what we forgot and what we need to set to make this work. The answer is bad news: there is nothing we can do!

The problem with SSE

The issue is this. SSE opens a stream to the client with no content length and sends packets down it when they become available. Here's the rub though, it uses Transfer Encoding which only guarantees the method of delivery to the next node in the chain.

Any old proxy in the middle of the connection between your server and the client's device can legally just store all those packets up and wait for the stream to close before forwarding them. It will do that because it sees no Content-Length and thinks - my client want's to know how big it is. Maybe the code predates text/event-stream that needs no Content-Length, who knows, but they are out there and they're gonna steal your lunch money.

Yep, this is all spec and legal and there is no header you can send to disable it. You say "I wanna send it all" the next node in the chain just overrides that and says "I think I'll chunk this until it's done".

Sure you can disable it on NGINX (one hop from your server) but who knows what out there just broke your app. Bottom line is, if you don't control the network infrastructure you can't rely on SSE.

Bummer.

Pushing to a client

Ok so there are basically 4 ways of getting "uninitiated" requests from a server:

Method Description Comments
Websockets Bidirectional communication between client and server - an open "socket" on each end to receive information. Sockets are great when they work, getting them to work stably is difficult and we can use libraries like socket.io to help us, which uses a number of techniques (like Long Polling) when a socket isn't available.
  • Needs help to work in the real world from libraries due to all sorts of complicated handshaking and break down of operations in some environments (especially load balancing).
  • Libraries like socket.io use long polling to get an early connection and try to upgrade.
  • Upgrading fails in lots of real world circumstances so you can only rely on the performance of long polling.
  • You need to perform your own handling for lost messages and reconnection.
Server Sent Events One way push communication from the server to the client with an always open connection.
  • Handles disconnect/reconnect and missed messages
  • No need for special handshaking
  • Network proxies and other hardware/software can totally break it.
Long Polling Client opens a connection to the server which waits until it has messages and send them. Client immediately opens a new connection. Feels like always available to push from server.
  • Always works, although some proxies could close the connection quickly if no data starts to be sent - adds a performance overhead but not a failure.
  • Requires frequently re-establishing a connection and the cost of headers establishment etc.
  • Connection specific resources and caching are harder as each new block of messages could be routed to a different server without sticky sessions.
Polling Naive method, client requests events on a regular basis.
  • Latency - we only get requests when we ask, can't therefore rely on it for high speed messaging (e.g. chat etc).
  • We make requests even if there isn't any data - a potentially significant performance impact

Here is one of the articles we used when initially deciding on using SSE

Conclusion

We've now rewritten our layer to use Long Polling so we can have consistent performance in the normal environments our software operates on (the internet and some very old corporate and industrial networks). It works. I wish I'd known the limitations of SSE before - but only found one paragraph in the spec very very late in the day.

Top comments (23)

Collapse
 
dunglas profile image
Kévin Dunglas

Does the problem occur if the connection is using HTTPS? Most old proxies aren't able to do HTTPS decryption, and so cannot read the headers.
Also, instead of a custom long polling system, you could continue to use SSE. It's perfectly valid for the SSE server to close the connection after every push (as in long polling). By using this workaround you still benefit from the EventSource class. You can even go one step further and detect is the connection is using HTTP/1 or HTTP/2. If it is using HTTP/1, you can close the connection after every event for compatibility with old proxies, and continue to use a persistent connection with HTTP/2 (because AFAIK all modern proxies supporting HTTP/2 support SSE too).

Collapse
 
tares42 profile image
tares42

Thanks Mike for this great post and thanks Kévin for this very useful hint. Closing the connection helped me much in getting my application to work properly. The data is received now immediatly even when some old proxies are in between. The SSE connection is re-established automatically after the reconnection delay.
SSE is production ready!
scrumpoker.works/

Collapse
 
miketalbot profile image
Mike Talbot ⭐

It's a very good point. We serve HTTP/2 (via AWS Cloud Front). Now what the client browser is getting I don't know - yet. I'll report back if I find any more information there.

Collapse
 
sjoerd82 profile image
Sjoerd82

Did you get the insights on this? For my application SSE seems a very good fit, but your story put me on guard.

If by forcing a HTTPS connection (which by now should be widely acceptable, if not flat out the default for normal everyday applications) this issue is mitigated, then that is valuable knowledge to add to the equation here..

Thread Thread
 
kenricashe profile image
Kenric Ashe

Sjoerd82 in your experience, assuming you force HTTPS, has SSE been stable for you?

Collapse
 
kenricashe profile image
Kenric Ashe

I would also like to know whether HTTPS solves the issue. Thank you.

Collapse
 
kenricashe profile image
Kenric Ashe

Hello Kévin, having just found this article and comments a year after they were published, I am curious if you have experienced any issues with your proposed workarounds. In your opinion, is SSE production ready? I am at the point where I am ready to load test my app and thus I hope the answer is yes. ;-)

Collapse
 
dunglas profile image
Kévin Dunglas

Hi @kenricashe. Yes, according to me SSE is totally suitable for production. I use it in prod for years on my own projects and I also manage a SaaS product built on top of SSE and the Mercure protocol, which have many customers and is serving a huge number of SSE connections every days without problems (mercure.rocks).

Collapse
 
sirseanofloxley profile image
Sean Allin Newell

Excellent write up. I shall reap your hard earned experience.

Collapse
 
terkwood profile image
Felix Terkhorn • Edited

Yeah, sharing this type of difficult journey with the community has a positive impact (as evidenced by the comments). I've spent some time with websockets, and the summary of their warts is spot-on.

I'll absolutely think twice before considering SSE for anything other than lab work!

Thanks!

Collapse
 
sleavely profile image
Joakim Hedlund • Edited

Interesting discovery! As soon as I reached the "20 minutes" paragraph I assumed it was a proxy or router along the way that killed the connection due to inactivity. I didn't expect the lack of Content-length to be the culprit. Love the comparison table.

I disagree a with the premise for the title, though. I get the impression that the application has been built to use client requests as a means to trigger SSE responses. SSE is great for pushing auxiliary real-time information, but the client should not receive responses based on its own actions through the SSE channel; for e.g. logging in, it should still receive a response as part of its regular RESTful request.

I think Server-Sent Events are production ready, but I think their usage should be limited to messages like "oh hey you have 2 new notifications" that leaves the client to decide whether to fetch the notifications instead of "Oh hey your Aunt May called and asked how the [...]". Much like getting an SMS letting you know that you've got a voice mail, but leaving you to decide whether to listen to it.

Collapse
 
miketalbot profile image
Mike Talbot ⭐

Hmmm. Well they wouldn't work well for notifications in your last point given the limitations. SSE are frequently described as being valid choice for something like a chat app - which clearly they aren't. I can find no documentation other than the spec that indicates otherwise. Many applications absolutely require back end initiated communication - anything which relays effectively. If that is not working, the entire principle of server sent events is broken to my mind.

  • You can't write a chat app - because new messages won't be delivered - let alone what I'm using it for.

I guess somewhere splattered all over articles about it I'd like to see: this won't work through some routers or networks. Hence I wrote this. It's not production ready for a whole series of use cases that are documented by others.

Collapse
 
sleavely profile image
Joakim Hedlund

I should have expanded on my notification example, sorry about that. What I meant was that the client has the ability to send a request to /voicemails and receive a response as part of that request, making the SSE a helpful nicety. Progressive enhancement, if you will :)

A chat app implies two-way communication though, so I think conceptually SSE is ill-suited for that purpose. If you choose to go that route - with what we've now learned in mind - I'd then implement some sort of handshake mechanism to test the connection and fall back on polling if the handshake does not complete in time.

As for corporate networks, I think it's safe to assume there will always be quirky setups that prohibit beyond-basic usage. My favorite pet peeve example of corporate stupidity are password policies that restrict lengths or confine you to latin alphanumerical characters.

Thread Thread
 
miketalbot profile image
Mike Talbot ⭐

Yes we see that all the time too on passwords.

That and the fact that 24% of my users are on IE11. Nice. It's the one thing as a developer of enterprise apps that always concerns me - caniuse.com uses browser stats for visitors - clearly not many devs are running IE11 on their development machine while browsing docs - so the stats always seem very low.

I know where you are coming from with your point on progressive enhancement and the use of SSE. My point is that the documentation says "it does this", many articles about it says "it does this". And then there is a paragraph that says this way down in the bowels of the thing:

Authors are also cautioned that HTTP chunking can have unexpected negative effects on the reliability of this protocol. Where possible, chunking should be disabled for serving event streams unless the rate of messages is high enough for this not to matter.

"Where possible" is the killer :)

And then somewhere else entirely you can find a reference to the fact you can't actually disable chunking on a network you don't own.

Collapse
 
jwp profile image
John Peters

OMG Mike, what an excellent discovery! If the design requires sockets then it's most likely for speed (direct peer to peer), right?. I'm wondering what something like RabbitMQ would have done in this situation?

Thank you for this Long Polling tip.

Collapse
 
miketalbot profile image
Mike Talbot ⭐

Hey John, we are using Bull and moving to Rabbit on the back end and indeed that's what gives us the ability to easily rewind events on a reconnect. In this case I think it's either sockets - but I just hate the amount of code we have to write around socket.io, or the long polling - which is basically now working- had to do the reconnect stuff ourselves but that is easier given just how simple long polling is compared to sockets. Performance seems to be holding up, but this is a live situation haha! Not done full scalability on it yet...

Collapse
 
dyfet profile image
David Sugar • Edited

I had considered ZMQ / scalable socket services for p2p telephony signaling, but it's bidirectional eventing is not all that strongly developed; mostly ZMQ is still about unidirectional messaging.

Collapse
 
kspeakman profile image
Kasey Speakman

Huge thanks for this article. Just found it while researching SSE. Very helpful as I also support clients with industrial networks.

Curious if you could still use SSE with clients/networks that support it. Have the server send a canary message right away upon connecting. If the client does not receive the canary message within a few seconds of connecting, then the client knows SSE is not safe to use and switches to long polling.

Collapse
 
miketalbot profile image
Mike Talbot ⭐ • Edited

Yes we still do that. We have a fall back to Long Polling with a pretty simple layer over it. If we don't get messages from SSE in response to an initial "ping" then we have a layer that basically flushes the stream every time (with a small debounce delay) and reopens it. We send a command to the server that says "treat" this stream as always close and reopen. Closing the stream does cause the proxies to forward on all of the data.

Collapse
 
jimmont profile image
Jim Montgomery

What are the specifics of the "old" proxy? Were the headers commonly used for related scenarios used in the problematic scenario? Related reading: stackoverflow.com/questions/136727...
stackoverflow.com/questions/610290...

Collapse
 
dyfet profile image
David Sugar

I came into this issue a lot when I was doing secure messaging applications last decade, and we looked at all these then, too. In the end we went with a dedicated websocket on a side channel reserved just for client event notifications (such as for a new message waiting) and used long polling to actually collect messages / as alternative for when the websocket died. This problem still needs a better solution, though some think it will magically appear with http/3.

Collapse
 
nunofgs profile image
Nuno Sousa

Great post. You mentioned running scalability tests. Could you give an example on what those look like?

Collapse
 
miketalbot profile image
Mike Talbot ⭐

So it basically means testing a "landing server" with an increasing number of connections until it breaks. Spinning up "dummy" clients that perform basic operations. This tests the end point robustness. We'd say X concurrent users per landing server is the minimum to pass a test and look to see if we have improved upon it by code changes.

Our architecture has landing servers which authenticate users and forward requests to queues. Queued jobs are picked up by nodes that can do singular things or lots of things. A kind of heterogenous grid. Landing servers need to listen for a relay events to the user on job completion.