DEV Community

Cover image for Guarantee message deliveries for real-time WebSocket APIs with Serverless on AWS
Mohammed for AWS Community Builders

Posted on • Originally published at mohdizzy.Medium

Guarantee message deliveries for real-time WebSocket APIs with Serverless on AWS

Introduction

Real-time updates are crucial to all modern applications because maintaining a high-quality consumer experience is paramount. Traditionally, setting up a WebSocket infrastructure is tedious and can get difficult when things begin to scale.

Fortunately, AWS recently rolled out a feature as part of their AppSync product offering where you can create a pure Serverless WebSocket API for real-time communication, and it’s called AppSync Events. Its minimalist setup is quite appealing, however, there is one problem in general with WebSocket communication; message delivery guarantee.

While the caller is listening for events, there are situations like for a mobile app to lose its data connectivity making it go offline and potentially losing out on messages that might have been sent during that period. When the subscriber is back online, we ideally want to ensure those messages get delivered. In this article, we’ll explore one of the ways of dealing with this problem in Serverless style using AWS.

Getting Started with Events API

The setup is quite simple. Provisioning the API itself involves only two parts: choosing the Authorization mode(s) and defining the channel namespaces.

Namespaces are dedicated channels you create by sub-paths and then allow clients to subscribe to specific channels (eg. /new-updates/sports/football) or a subset by using wildcards (eg. /new-updates/sports/*)

Authorization can be defined at the API level and then overridden at the channel namespace level for publishers and subscribers.

Once the API is provisioned, the real-time endpoint is used for subscribing to one of the defined channels. The HTTP endpoint is used by the publisher for pushing events into a channel.

Retry Message Delivery Flow

The publisher is only responsible for pushing messages to specified channels. Whether that message reached some of the subscribers or none of the subscribers is not the publisher's concern. For mission-critical applications, not knowing if the published message reached the intended recipient or not can be unsettling. Naturally, losing messages is not ideal, so one possible way to deal with this would be to let your backend systems know when messages are successfully received by your subscribers.

In the flow indicated above, the first part is to deliver the message. The second portion is where the client calls an HTTP endpoint to acknowledge the receipt of that message. If the acknowledgment is not received within a minute, we publish the same payload to the same channel and continue to do so for the next ten minutes with the hope that the client is back online and receives the message.

Here is a quick summary of the technical aspects involved in the flow,

  • Publish the payload with a UUID included as part of it. UUIDv7 has the advantage of being time-ordered, which is great as it might help in situations where the ordering of messages can be important.
  • Create a schedule using the Eventbridge Scheduler service with the name being the UUID itself. The target is another Lambda function with the interval being 1 minute and expiring in the next 10 minutes with auto delete enabled.
  • Store the published payload in S3 with the object name being the UUID itself. The reason for having S3 in the picture is for larger payloads. Most Serverless components within AWS have 256KB payload limits which might be restrictive in some situations. The Events API supports payload sizes up to 1.2MB!
  • The client after receiving the message, calls an HTTP endpoint with the UUID included in the request body. A Lambda uses that UUID to remove the schedule from Eventbridge thereby preventing the retry flow from moving forward.
  • If the acknowledgment hasn’t come through, the Scheduler will invoke the retrier Lambda every minute. The Scheduler name is the UUID which also is the name of the object in S3 with which the Lambda can retrieve the payload and push it to the Channel again.
  • The Scheduler is set to end in 10 minutes as we don’t want to keep publishing for an extended period.

Here is the Github link to the complete setup. I’ve deployed the Events API infrastructure using the Serverless framework via vanilla Cloudformation syntax. I assume the support for AWS CDK currently might be limiting given the service was only recently rolled out.

Closing thoughts

The above flow works fine when we have one subscriber for a specific channel. If there are multiple subscribers, then ensuring deliveries for all of them might require some adjustments to the flow. The changes would need to be something along the lines of maintaining schedules per subscriber implying that the publisher needs to know in advance all the relevant subscribers.

If the published payload is less than 256KB, an alternate way to approach the retry flow would be to use DynamoDB and SQS. DynamoDB would hold all published UUIDs along with a flag. Each published message would get pushed into FIFO SQS having a delay period of 1 minute or so. The Lambda consuming the queue would first check the acknowledgment flag against the UUID in DynamoDB, if false, it would publish the payload again. With SQS batch processing enabled and utilizing the partial batch failures feature, we could set up retries to 2 or any other number as needed.

Alternate flow using SQS,DynamoDB

To allow the use of the retrier Lambda across multiple APIs, we should logically place the endpoint, namespace, and other metadata within the Scheduler service payload so that the retrier flow can behave like a common piece across all services utilizing the WebSocket flow.

Top comments (1)