Hatem Elseidy

Posted on Oct 1, 2023 • Updated on Oct 8, 2023

Part 2 - Problem Representation

Introduction

In part 1 of this series, we discussed the high level overview of our fully automated story generator. We also discussed the Content Generation problem out of total 5 problems. In content generation, we generated story text, processed it and generated images.

In this part, we will discuss the problem representation and the data structures used all over the code.

Reminder

The code can be found in this github repo.
https://github.com/hatemfaheem/ai-story-generator

And in this youtube channel, you can find lots of examples auto generated from this codebase (with ZERO video/audio editing involved). Going back in this channel you can see earlier versions that had less compelling results.

Example old version: https://www.youtube.com/watch?v=rM7l-B0wsx4
Example newer version: https://www.youtube.com/watch?v=xURG3wQ0Jtg

This could show how we can improve by doing multiple iterations. Another good place to learn about the power of iterations is to look at the commit history of this project.

https://github.com/hatemfaheem/ai-story-generator/commits/main

Why discussing this?

This is actually the most important section. It explains how we can represent a complex problem, like generating video from text, in simple to understand data structures. If we get this right, then every other component's job is to produce one of these data structures and consume one or more other data structures.

All data structures used around the code can be found in [data_models.py](https://github.com/hatemfaheem/ai-story-generator/blob/main/data_models.py). As you can see it's mostly native python except for Image from PIL.

from PIL.Image import Image

Let's just go from top to bottom.

A. Content Generation Data Structures

Back to the high level architecture in part 1, there are 2 main parts. Let's start by the content generation side and next section move to content processing data structures.

Story Size

Let's take the below example. There are 5 main dimensions:

[Red] Page Width (The width of the whole page).
[Green] Page Height (The height of the whole page).
[Purple] Text Part Width (The width of the text part of the page).
[Blue] Image Part Width (The width of the image part of the page).
The font size. Playing around with numbers, it's better if we adapt the font size based on the image size.

Now, the biggest restriction we have is image generation. If you take a look at OpenAI image generation docs, you'll see that we can generate only square images of 3 different sizes. That's 256x256, 512x512 and 1024x1024. So, to avoid errors with image generation, let's design our stories around only these 3 sizes.

As I was designing this for Youtube videos, I learned that the best aspect ratio for Youtube videos is 16:9. Hence, we have to extend the square images with a text part that has a width of (page_width - image_width) where page_width / image_width = 1.777. Which is the target aspect ration. With simple math, you get the following numbers for the page dimensions (height, width).

SIZE_256 = (256, 455)
SIZE_512 = (512, 910)
SIZE_1024 = (1024, 1820)

Now, we can create an enum to calculate the missing values from these 2 numbers:

class StorySize(Enum):
    """The sizing configuration of the story."""

    SIZE_256 = (256, 455)
    SIZE_512 = (512, 910)
    SIZE_1024 = (1024, 1820)

    def __init__(self, image_part_size: int, page_width: int):
        self.page_width: int = page_width
        self.page_height: int = image_part_size
        self.image_part_size: str = f"{image_part_size}x{image_part_size}"
        self.text_part_width: int = page_width - image_part_size
        self.text_part_height: int = image_part_size
        self.font_size: int = self._get_font_size(image_part_size)

Font size, is just trail and error. For each input size, I hard coded the following numbers:

def _get_font_size(size: int) -> int:
    return {256: 16, 512: 38, 1024: 58}[size]

Finally, we want to make the command line interface simple (that's the main interface for now). Hence, I implemented a method that maps the 3 main numbers to the enum. So, when we specify input size, we just specify 256, 512 or 1024 as input.

def get_size_from_str(size: str):
    return {
        "256": StorySize.SIZE_256,
        "512": StorySize.SIZE_512,
        "1024": StorySize.SIZE_1024,
    }[size]

So, by running this method, you get an object with all the different dimensions for the story. All inclusive.

Story Content

To represent the contents of the story, we created 3 data structures StoryText, StoryPageContent and StoryContent.

StoryText is simple, it contains raw text from OpenAI and tokenized sentences as discussed in part 1.

class StoryText:
    raw_text: str
    processed_sentences: List[str]

StoryPageContent represents the contents of a single page. The text of this page (sentence), the actual image of the page, the path of the image on local disk, and finally the page number.

class StoryPageContent:
    sentence: str
    image: Image
    image_path: str
    page_number: str

StoryContent represents the contents of the story as a whole. story_seed is the input sentence (title) of the story, raw_text again the full raw text of the story, page_contents is a list of StoryPageContent and story_size is the object that contains all dimensions info discussed above.

class StoryContent:
    story_seed: str
    raw_text: str
    page_contents: List[StoryPageContent]
    story_size: StorySize

Now, you can see that given this StoryContent object you know pretty much everything about the generated story include it's seed title, text, images and size. Remember that generated images had specific size that's why we couple the size with the content.

You could explore generating the images once and resizing them based on the input size to save OpenAI calls. That way we can decouple the size from contents.

Note: If you look at story_utils.py, you will see that we are saving and loading the StoryContent object. This allows us save the contents after relatively expensive OpenAPI calls and avoid regenerating the contents in case of error in further steps.

B. Content Processing Data Structures

Once we have the contents of the story, images and text. We want to process it into a compelling nice video. That includes, background music, voice over, etc.

AudioInfo

As simple as shown below, a string mp3_file that points to an mp3 file on local disk and the length of this audio as length_in_seconds.

class AudioInfo:
    mp3_file: str
    length_in_seconds: float

StoryPage

StoryPage builds on top of StoryPageContent. It adds the final image of the full page (text + generated image), and that it contains audio information for voice over of that page.

class StoryPage:
    page_content: StoryPageContent
    page_image: Image
    page_filepath: str
    audio: AudioInfo

Story

That's the final all inclusive story. The main reason for this class is to get everything in 1 place. You may disagree with this approach, but I find it simpler in prototyping and fast iteration. Nothing really new here, by the names you can guess what each field represents.

class Story:
    story_seed: str
    story_raw_text: str
    pages: List[StoryPage]
    start_page_filepath: str
    end_page_filepath: str
    keywords: List[str]

How does this help writing the code?

Let's see a 2 examples.

1. Audio Generation

When we think about voice over or more specifically text speech, we know that a human being would look at the page and read out loud what is on that page. And this is how we exactly designed it here. As you can see, it takes as input StoryPageContent and return AudioInfo. When implementing this method, you don't really need to think about what's happening in other areas like image generation or page processing. Simple, isn't it?

@abstractmethod
def generate_audio(
    self, workdir: str, story_page_content: StoryPageContent
) -> AudioInfo:
    """Preform text to speech and generate an audio file for the given story page

    Args:
        workdir: The workdir where to save the audio files
        story_page_content: The content of a single page from the story

    Returns: AudioInfo object with filepath and length.
    """
    pass

2. Page Processing

As we will see in later parts, page processing is the process of generating an image from the story content. It combines everything from the contents into a nice looking page. As you can see below, it consumes StoryPageContent, AudioInfo, StorySize and produces StoryPage.

def create_page(
    self,
    workdir: str,
    story_page_content: StoryPageContent,
    audio: AudioInfo,
    story_size: StorySize,
) -> StoryPage: