Why and how we ditched Django Channels to build a scalable chat backend

#django #channels #apigateway #websocketapi

Hello DEV community!

This is my first post, and I am sharing the experience of using Django Channels for a couple of our products, which are both live, one(legacy product) still running django channels and another deployed at AWS API Gateway using Websockets API.

We had a high scope of using django channels when it was added to the Django's official github. But the reality of using it wasn't that exciting.

Why?

In one sentence, Django channels will not scale! I repeat it will not scale no matter how much you try.

It has a bunch of dependencies over other third party packages like daphne, asgiref, channels-redis etc., Though these packages might be maintained by the same developers they don't travel along with channels's timeline sometimes.
Dependency of Redis as the communication layer. Again if this Redis layer(single instance or cluster) doesn't scale then you have built a failed solution already.
Complications over deploying the ASGI web servers like Uvicorn. Like there is no clean way to deploy an ASGI application behind a well trusted proxy like nginx. It is always a mess to keep the connections alive between the clients and the ASGI app proxied with nginx irrespective of the configurations tried.
Even if you figured out all of the above, it is hard to deploy as many as django channels app instances to scale your entire app. The cost to run those EC2 or relevant instances and Redis to run your app in that scale will not worth the yield and they are hard to maintain.
And finally, fixing an issue in your codebase is like a searching for a needle in the haystack. In most of the cases, the issue would be within the core of Channels itself as your codebase would just be a boilerplate copied from the Channels documentation.

How?

Solved the problem using a PaaS like the AWS API Gateway. API Gatway has come a long way since it's inception. With it's Websocket API & Lambda integrations it became super easy to deploy a scalable chat application without worrying about deployment configurations.

Serverless Framework or AWS SAM

Ofcourse nothing comes without deployment configurations. But frameworks like Serverless or SAM(https://aws.amazon.com/serverless/sam/) make it simple. My personal choice is Serverless as it is provider independent and leaves a lot of space to explore for other projects.

Here is an example tutorial deploying a serverless chat application using serverless and API Gateway. It is bit outdated but definitely a good starter.

Advantages

Cost friendly: It is serverless, so you're billed only for what you use. Check the lambda pricing to get an approx. quote for your usage.
Scalable and Maintainable: The scale and maintanance is handled by AWS. If anything goes out of way, AWS is your direct point of contact to connect and resolve the issues with their technical help.
Decoupled: This way the chat backend is decoupled from your API backend. We use JWT tokens to identify users/clients between our backends.
Ease of lambda implementations: Think of lambda functions as microservices that does a single unit of job. For example, the send handler stores the message in db, broadcast it to other connected clients and terminates the function execution.

Limitations

This comes with a few limitations but they can be solved based on your business requirements.

Lambda quotas: By default, AWS Lambda functions has a limit of 1000 concurrent executions per second per region. Beyond that, the function execution will be throttled internally by AWS. But these limits can be increased by contacting support.
API Gateway limitations:
- One major technical limitation is the way to broadcast a message to other connected clients. Basically by iterating the list of connected clients and do a HTTP post to each connection id with the message payload. AWS should improve this. But with multiprocessing or multithreading this can be solved at your side to an extent.
- Another is the quota limits, that is only 10K requests/sec is allowed per account per region that includes all the requests that hits API Gateway in your current region. This limit can also be increased by contact the AWS support but it is necessary to keep the timeout of WS integrated lambda functions to be less than 1 second. This is critical and the only way I found it to be solved is calling an another lambda function to do the broadcast with the timeout of more than 1 second. Tried SQS + Lambda configurations instead of calling another lambda directly from lamdba(which many says as a bad practice) but again SQS has it's own limitations which is not suitable for real time applications and finally in my load test I achieved the required throughput.

That's all! This is only a brief justification of why django channels can't be used for real time chats and how API Gateway Websockets API can be a better alternative instead. I haven't covered too many tiny details that involves both technical and business aspects but would be happy to answer in comments.

Cheers 🍻