PL Sergent

Posted on Mar 30, 2023 • Edited on Apr 3, 2023

🐍Music video clip but every word is a Google image🎵

#python #opensource #googlecloud #ai

OWOI_AudioToClip

Python module used for the school project OWOI (One Word One Image)

Installation

After git cloning the repository, you can install the dependencies with the following command:

poetry install

Credentials

Please provide your credentials in the following environment variables:

export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
export GOOGLE_IMAGES_SEARCH_TOKEN="token"
export GOOGLE_SEARCH_ID="id"

Classes

TranscriptFactory

This class is used to create a transcript from a text file. It will create a list of words and a list of timestamps.

from owoi_audio_to_clip.TranscriptFactory import TranscriptFactory

transcript_factory = TranscriptFactory(gcs_uri="gs://bucket/file.mp3")

Methods:

transcribe_audio_to_text() -> list[dict]: transcribe audio to text from the gcs_uri and returns a list of dict with the following keys: "word", "start_time" and "end_time"
get_word_timestamps() -> list[dict]: returns a list of dict with the following keys: "word", "start_time" and "end_time"

This Class should be used to create a transcript from a text…

View on GitHub

A cool idea 💡

Everything started when I saw this video:

Sicko Mode but every word is a Google image

I just loved the idea to have an image for every word in the lyrics. A few years ago I was thinking about automating the process.

The project 📁

In my engineering school we had to do a project for the semester and so we decided to give it a try.

The idea was to produce a website where people could upload songs of their choices and generate similar video clips with this idea: each word becomes an image.

Here is the repo for the front end but we won't dive into this in this article.

Project OneWordOneImage (OWOI) | Equipe 7

Introduction

OneWordOneimage est un outil qui permet aux utilisateurs de créer des clips à base d'images synchronisées aux lyrics d'une musique.

Follow this link to see the source code : Github

Execution

requirements

Poetry
Python ^3.10
PostgreSQL
Web Browser

setup

Configure and launch the poetry environment

Documentation : Poetry

# Launch poetry
$ poetry shell

# First installation
# Check the Python version
$ poetry env info

# First installation
# Change the Python version (if less than 3.10^)
$ poetry enve use < path_python_^3.10 >

# Install the libraries of the environment
$ poetry install

Install and build npm libraries

# Install the npm libraries
# path : app/frontend
$ npm install --legacy-peer-deps

# Build the frontend
$ npm run build

Install Docker and setup the…

View on GitHub

The tools 🔨

In order to automate the creation of clips we would need tools to:

Recognize the lyrics of any songs with precise timestamp of each word
Fetch the images from Google Image
Concatenate the images to create a video clip

Lyrics recognition

We looked at different tools, the obvious better option was to use an API that would give us the lyrics directly (for instance Musicxmatch). But the issue was that we needed the timestamp of each word in order to match the pictures with the song.

That's why we decided to use Google speech-to-text.

This API powered by AI is originally not made to recognize voices singing, this will limit us in term of songs possibilities in the future. But with it we could get the lyrics with pretty good accuracy (again depending on the music) as well as the timestamps.

Google images

To fetch images from Google Image we simply used the Python package Google-Images-Search.

Tricky part for me:

You'll need to create a search engine here: https://programmablesearchengine.google.com/
In the package documentation they mention the project_cx which is actually your search engine id, which looks like this: 234903464501239304239:ccxz234er

Video clip editing

That's pretty much a no brainer, we used the package MoviePy.

Even though you might struggle a bit to debug issues related to Image Magick, the documentation is pretty well made.

This link saved me when MoviePy couldn't find Image Magick: https://github.com/Zulko/moviepy/issues/693#issuecomment-355587113

Also be careful when using subtitles, you'll need to have the selected font installed on your system.

The dev 🧑‍💻

Without going into the details you'll find those 2 mains classes:

TranscriptFactory: used to translate the song to lyrics
ClipMakerFactory: used to create the video clip

I eventually added a few useful functions to upload things to a GCP bucket, delete locally downloaded images and more importantly the ability to get an audio file from a YouTube link, which makes the module easier to use.

And finally here is the complete process to create a video:

Use a YouTube link => extract audio file and upload to the bucket
Extract the lyrics and timestamps with Google speech-to-text
Iterate over the words to get images from Google Image
Create individual temporary image clip for each word
Concatenate the clips and add the music
Upload video to the bucket

Example of usage

Full code: example usage on GitHub

Result 📼

The results are pretty convincing and quite funny 😄.

It's sometimes hard to see why some images have been selected. It's important to understand that the program is first looking for high resolution images and taking the first result. Also, here are some params I used:

_search_params = {
  "fileType": "jpg",
  "safe": "medium",
  "imgColorType": "color",
}

So those specific parameters sometimes produce interesting results to say the least.

There is a lot of things to take into account in order to not hurt people. The parameter safe is doing a good job at keeping the images "friendly" but sometimes it's more delicate. Like a few times for the word "down" the program would use a picture of someone having the down syndrome, or for the word "Turkey" it would show the recent earthquake that happened in the country.

There is a lot of political and moral aspects to consider and the program is not capable of having a critical point of view on this.

Improvement 🚀

I'm pretty happy with the result but not perfect by any means.

The AI speech-to-text struggles a lot when it comes to something else than rap music. Which make sense considering how the artists sing. The first improvement would be to find a different way to get the lyrics and timestamps from the audio. With the recent upgrades in the AI world I wouldn't be surprised if that'd be possible soon (or if someone has the solution in the comments lol).

The processing time is very very long (about 4-5 mins for a 20-30 secs clip). The longest part is the audio recognition.

And finally as I said before, we would need a way to prevent any hurtful pictures to appear in the video clips.