Amazon SQS and serverless DataEngineering workloads

#sqs #amazon #dataengineering #serverless

Overview
Amazon SQS provides fully managed message queuing for microservices, distributed systems, and serverless applications. Amazon SQS is one of the earliest services launched by Amazon and still widely used by many organizations and it forms one of the core services of SaaS / PaaS products that were built on top of the AWS cloud.

A variety of use cases exist in the microservices architecture for AWS SQS but when it comes to data engineering, SQS is commonly used to publish messages with dynamic configurations that in turn trigger consumers to scale/parallelize workloads based on the SQS message data. The reason is that SQS by default, doesn't handle messages greater than 256KB.

Not all SaaS applications support bulk load or query due to API rate limits, performance factors, and so on. In an event-driven data architecture, especially when producers and consumers are SaaS applications, cloud web apps, etc., event ingestion is done using AWS AppFlow, Kinesis, or custom batch query processes.

If any SaaS is customizable, then the team can develop their own API using the AWS SDK to support ingestions and queries. There are cases where GBs of processed data have to be uploaded in bulk mode to a SaaS or Cloud Web Apps datastore in less than a minute interval. Below are few scenarios that can be used in serverless event-driven data architectures.

DataEngineering Workload
SQS can be used to receive large payloads with the help of the extended client library. This feature has been in place since 2015 but is less commonly used in data engineering workloads. The AWS SDK extended SQS client library can be used to process up to 2 GB of data using S3 object storage. The producers of SQS can write messages > 256KB in S3 and publish metadata in SQS, and SQS consumers can read and process the data. After the data processing is complete, write the data to the target object store or share it via API using any SQS message consumer.

Consider an organisation that is using a customizable SaaS that was built using AWS services. The extended Java client library can be leveraged when a user needs to upload documents > 256KB or an event is triggered for user action. These events can be used for building a decoupled data engineering workload that requires document parsing or tagging using services like AWS Textract, AWS Glue, or Serverless EMR. The DataSourceRegister Spark API along with Spark structured streaming can be used as an SQS consumer for any high volume. The processed data can be shared with a webapp or SaaS using microservices or loaded into OLAP or OLTP services.

Multipart files of less than 2GB can be written into S3 using SQL unload. Later, SQS consumers like AWS Lambda , AWS Glue, etc. can be invoked to write the processed files into target SaaS object stores in a concurrent manner, provided the SaaS application supports the bulk load API.

Limitations
Unlike kafka, SQS can't support non-ASCII characters like emojis. There is limited support for Unicode characters in the SQS service. So one has to consider the type of message, the complexity of transformation/aggregation, message retention period, etc. before deciding on the type and need of message brokers for data engineering workloads.

Conclusion
Typically, it's a common design pattern to have MQs, Kafka as message brokers, but Amazon SQS can also be leveraged for building loosely coupled data engineering pipelines for data generated by SaaS applications. The serverless AWS services are inherently scalable, and SQS can help to achieve the parallelism of data processing in an event-driven data architecture without the need for any additional technical stack.

DEV Community

Amazon SQS and serverless DataEngineering workloads

Top comments (0)