DEV Community

eelayoubi
eelayoubi

Posted on

AWS Text-To-Speech Serverless Application

In this article we will go through deploying a Text-To-Speech Serverless Application that contains two main flows presented in the following section.

Architecture

New Post

New Post

  1. The user calls the API Gateway Restful endpoint that invokes the NewPost lambda function.
  2. The NewPost Lambda function inserts information about the post into an Amazon DynamoDB table, where information about all posts is stored.
  3. Then, the NewPost Lambda function publishes the post to the SNS topic we create, so it can be converted asynchronously.
  4. Convert to Audio Lambda function, is subscribed to the SNS topic and is triggered whenever a new message appears (which means that a new post should be converted into an audio file).
  5. The Convert to Audio Lambda function uses Amazon Polly to convert the text into an audio file in the specified language (the same as the language of the text).
  6. The new MP3 file is saved in a dedicated S3 bucket.
  7. Information about the post is updated in the DynamoDB table (the URL to the audio file stored in the S3 bucket is saved with the previously stored data.)

Get Post

Get Post

  1. The user calls the API Gateway Restful endpoint that invokes the GetPost lambda function which contains the logic for retrieving the post data.
  2. The GetPost Lambda function retrieves information about the post (including the reference to Amazon S3) from the DynamoDB table and returns the information.

Deploying The Resources

You will find the code repository here.

To deploy the application, simply run:

terraform apply -auto-approve
Enter fullscreen mode Exit fullscreen mode

Breaking it down

We are using Terraform to deploy the application.

As you saw in the architecture section. We have two flows, the first one, is when a user submit a post through the API Gateway. And the second flow, in which the user can request one post or all posts.

In the main.tf, we are using the terraform modules that we create, to provision the following:

  • DynamoDB Table for storing the posts metadata
  • NewPost Lambda function
  • GetPost Lambda function
  • ConvertToAudio Lambda Function
  • The API Gateway

The iam.tf is where we define the least-privilege permissions for the various lambda functions we have.

The public S3 bucket that is used to store all the audio posts, is declared in the s3.tf.

The SNS topic used to decouple the application into an asynchronous flow is defined in the sns.tf

We are creating a topic called new_posts, with a lambda as a subscription. So every time this topic receives a message, it will invoke the ConvertToAudio lambda function with that message as an event.

To allow SNS to invoke the ConvertToAudio lambda function, we create as well a lambda permission resource.

NewPost Lambda Function

The NewPost lambda function does two things:
1- Creates an item in the posts table with the following schema:

id: recordId,
text: string,
voice: string,
status: 'PROCESSING'
Enter fullscreen mode Exit fullscreen mode

The status as you can see is PROCESSING, since it still needs to be converted to an audio.

2- Publishes the item id to the SNS topic (so it can be process by ConvertToAudio)

ConvertToAudio Lambda Function

The ConvertToAudio lambda function does the following:

  1. Fetches the post metadata from the posts table
  2. Splits the text into chunks (if text is larger than 2600 characters)
  3. Invokes Polly synthesizeSpeech with the text and saves them to a local file
  4. Uploads the audio file to the S3 bucket
  5. Updates the post metadata in the posts table with the S3 audio url and sets the status to UPDATED

GetPost Lambda Function

The GetPost lambda function retrieves the post(s):

  • If postId (querystring) is "*", the function will return all posts in the posts table
  • If postId is a post Id, the function will return that specific post (if it exists) or an empty array.

Clean-up

Don't forget to clean everything up by running:

terraform destroy -auto-approve
Enter fullscreen mode Exit fullscreen mode

Wrap-up

In this article I presented the Text-To-Speech serverless application. In the next article, I will go through how we can handle errors that our lambda could encounter, and how to process them in the API gateway.

Until then, thank you for reading this article. I hope you found it useful 👋🏻.

Top comments (1)

Collapse
 
samyme profile image
SamyMe

Or if you want, you can access All Speech to text APIs providers in the market (Google, AWS, Microsoft, AssemblyAI, RevAI, Deepgram) directly through a unique endpoint : docs.edenai.co/reference/audio_spe...