Part 3 - Text to Speech

Introduction

In previous parts of this series, we discussed the high level overview of our fully automated story generator. We also discussed text/images generation using OpenAI and how we represented the whole problem in simple data structures.

In this part, we will discuss text to speech (Audio generation or voice over).

Text to Speech

Looking around for a python library that performs text to speech, I found one called gTTS. But I noticed the below disclaimer on the github page.

This project is not affiliated with Google or Google Cloud. Breaking upstream changes can occur without notice. This project is leveraging the undocumented Google Translate speech functionality and is different from Google Cloud Text-to-Speech.

I also didn't really like the resulted voice as it seemed too robotic (check this example). But I thought it would be nice to integrate text to speech, make the story generator work end to end, then iterate on separate parts.

Audio Generation Abstract Class

To enable easy experimentation with different text to speech technologies and support multiple different audio generators. I decided to create an abstract class that represents that abstract audio generation functionality, use that everywhere we generate audio and inject different implementations as needed.

For example, StoryManager which is a high level manager that combines several other components, takes AbstractAudioGenerator as an input and doesn't care how it's implemented.

def __init__(
    self,
    audio_generator: AbstractAudioGenerator,
    keywords_generator: KeywordsGenerator,
    page_processor: PageProcessor,
    pdf_processor: PdfProcessor,
    video_processor: VideoProcessor,
):

You can find the AsbtractAudioGenerator here. As you can see below, it's super simple. It has a main public method called generate_audio and a couple of helper methods. We will see later on, how these are used.

from abc import ABC, abstractmethod
from mutagen.mp3 import MP3
from data_models import AudioInfo, StoryPageContent


class AbstractAudioGenerator(ABC):
    @abstractmethod
    def generate_audio(
        self, workdir: str, story_page_content: StoryPageContent
    ) -> AudioInfo:
        pass

    @staticmethod
    def _get_length_in_seconds(mp3_filepath: str) -> float:
        return MP3(mp3_filepath).info.length

    @staticmethod
    def _get_mp3_filepath(workdir: str, story_page_content: StoryPageContent) -> str:
        return os.path.join(workdir, f"audio_{story_page_content.page_number}.mp3")

gTTS Implementation

gTTS is very simple, all we need to do is call gTTS and then call a save method to save the results to an mp3 file.

audio = gTTS(text="a sentence from the story", lang="en", slow=True)
audio.save("/my/fantastic/story/audio/page.mp3")

To work with our properly designed framework, we will need to implement AbstractAudioGenerator and return AudioInfo object. And this where we need _get_mp3_filepath and _get_length_in_seconds. The 2 helper methods we implemented in the abstract class.

class AudioGeneratorGtts(AbstractAudioGenerator):
    def generate_audio(
        self, workdir: str, story_page_content: StoryPageContent
    ) -> AudioInfo:
        print(f"Generating audio for: {story_page_content.sentence}")
        audio = gTTS(text=story_page_content.sentence, lang="en", slow=True)
        mp3_filepath = self._get_mp3_filepath(workdir, story_page_content)
        audio.save(mp3_filepath)
        length_in_seconds = self._get_length_in_seconds(mp3_filepath)
        return AudioInfo(mp3_file=mp3_filepath, length_in_seconds=length_in_seconds)

AWS Polly Implementation

Amazon Polly uses deep learning technologies to synthesize natural-sounding human speech, so you can convert articles to speech. With dozens of lifelike voices across a broad set of languages, use Amazon Polly to build speech-activated applications.

You can find the full implementation here. And here are few things to note.

First things first, you need an AWS account and you need AWS credentials with access to Polly. An AWS user with access key and secret key is one way to achieve this, but you can also use IAM roles.

Using boto3 we can create an AWS Polly client with pre-configured credentials like this:

from boto3 import Session

def __init__(self, aws_polly_credentials_provider: AwsPollyCredentialsProvider):
    self.session = Session(
        aws_access_key_id=aws_polly_credentials_provider.access_key,
        aws_secret_access_key=aws_polly_credentials_provider.secret_key,
    )
    self.polly = self.session.client("polly")

Now we can call synthesize_speech from that polly client.

return self.polly.synthesize_speech(
    Engine=self._ENGINE_NEUTRAL,
    TextType=self._TEXT_TYPE_SSML,
    Text=self._construct_ssml(text=text),
    OutputFormat=self._OUTPUT_FORMAT_MP3,
    VoiceId=self._VOICE_ID,
    LanguageCode=self._LANGUAGE_CODE_EN_US,
)

Then the rest of this class is logic to process input and output from/to polly. You can look at other examples online, or AWS documentation.

Other Implementations

You can expand this to other implementations if you have other preferred technologies.

Conclusion

That's it for speech to text. We built a simple extendible system that takes as input StoryPageContent and returns AudioInfo. And we saw 2 examples of different implementations.

Next, we will look at the most fun, artistic and complex component of this project. That's the page processor. This is where we learn about some basic image processing/manipulation using PIL.