DEV Community

Cover image for 🐍Music video clip but every word is a Google image🎵
PL Sergent
PL Sergent

Posted on • Updated on

🐍Music video clip but every word is a Google image🎵

GitHub logo PLsergent / OWOI_AudioToClip

Python module used for the school project OWOI (One Word One Image)


Python module used for the school project OWOI (One Word One Image)


After git cloning the repository, you can install the dependencies with the following command:

poetry install
Enter fullscreen mode Exit fullscreen mode


Please provide your credentials in the following environment variables:

export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
export GOOGLE_SEARCH_ID="id"
Enter fullscreen mode Exit fullscreen mode



This class is used to create a transcript from a text file. It will create a list of words and a list of timestamps.

from owoi_audio_to_clip.TranscriptFactory import TranscriptFactory
transcript_factory = TranscriptFactory(gcs_uri="gs://bucket/file.mp3")
Enter fullscreen mode Exit fullscreen mode


  • transcribe_audio_to_text() -> list[dict]: transcribe audio to text from the gcs_uri and returns a list of dict with the following keys: "word", "start_time" and "end_time"
  • get_word_timestamps() -> list[dict]: returns a list of dict with the following keys: "word", "start_time" and "end_time"

This Class should be used to create a transcript from a text…

A cool idea 💡

Everything started when I saw this video:

Sicko Mode but every word is a Google image

I just loved the idea to have an image for every word in the lyrics. A few years ago I was thinking about automating the process.

The project 📁

In my engineering school we had to do a project for the semester and so we decided to give it a try.

The idea was to produce a website where people could upload songs of their choices and generate similar video clips with this idea: each word becomes an image.

Here is the repo for the front end but we won't dive into this in this article.

GitHub logo layfredrc / OneWord_OneImage

Projet Transverse Equipe 7

Project OneWordOneImage (OWOI) | Equipe 7

Badr TADJER | Frédéric LAY | Pierre-Louis SERGENT | Leo TRAN | Meo BIENFAIT | Younes BOUCHAKOUR


OneWordOneimage est un outil qui permet aux utilisateurs de créer des clips à base d'images synchronisées aux lyrics d'une musique.

Follow this link to see the source code : Github



  • Poetry
  • Python ^3.10
  • PostgreSQL
  • Web Browser


  1. Configure and launch the poetry environment

    Documentation : Poetry

    # Launch poetry
    $ poetry shell
    # First installation
    # Check the Python version
    $ poetry env info
    # First installation
    # Change the Python version (if less than 3.10^)
    $ poetry enve use < path_python_^3.10 >
    # Install the libraries of the environment
    $ poetry install
    Enter fullscreen mode Exit fullscreen mode
  2. Install and build npm libraries

    # Install the npm libraries
    # path : app/frontend
    $ npm install --legacy-peer-deps
    # Build the frontend
    $ npm run build
    Enter fullscreen mode Exit fullscreen mode
  3. Install Docker and setup the…

The tools 🔨

In order to automate the creation of clips we would need tools to:

  • Recognize the lyrics of any songs with precise timestamp of each word
  • Fetch the images from Google Image
  • Concatenate the images to create a video clip

Lyrics recognition

We looked at different tools, the obvious better option was to use an API that would give us the lyrics directly (for instance Musicxmatch). But the issue was that we needed the timestamp of each word in order to match the pictures with the song.

Google cloud

That's why we decided to use Google speech-to-text.

This API powered by AI is originally not made to recognize voices singing, this will limit us in term of songs possibilities in the future. But with it we could get the lyrics with pretty good accuracy (again depending on the music) as well as the timestamps.

Google images

To fetch images from Google Image we simply used the Python package Google-Images-Search.

Tricky part for me:

  • You'll need to create a search engine here:
  • In the package documentation they mention the project_cx which is actually your search engine id, which looks like this: 234903464501239304239:ccxz234er

Video clip editing

That's pretty much a no brainer, we used the package MoviePy.

Even though you might struggle a bit to debug issues related to Image Magick, the documentation is pretty well made.

This link saved me when MoviePy couldn't find Image Magick:

Also be careful when using subtitles, you'll need to have the selected font installed on your system.

The dev 🧑‍💻


Without going into the details you'll find those 2 mains classes:

  • TranscriptFactory: used to translate the song to lyrics
  • ClipMakerFactory: used to create the video clip

I eventually added a few useful functions to upload things to a GCP bucket, delete locally downloaded images and more importantly the ability to get an audio file from a YouTube link, which makes the module easier to use.

Overall schema

And finally here is the complete process to create a video:

  • Use a YouTube link => extract audio file and upload to the bucket
  • Extract the lyrics and timestamps with Google speech-to-text
  • Iterate over the words to get images from Google Image
  • Create individual temporary image clip for each word
  • Concatenate the clips and add the music
  • Upload video to the bucket

Example of usage

Full code: example usage on GitHub

code example

Result 📼

The results are pretty convincing and quite funny 😄.

It's sometimes hard to see why some images have been selected. It's important to understand that the program is first looking for high resolution images and taking the first result. Also, here are some params I used:

_search_params = {
  "fileType": "jpg",
  "safe": "medium",
  "imgColorType": "color",
Enter fullscreen mode Exit fullscreen mode

So those specific parameters sometimes produce interesting results to say the least.

There is a lot of things to take into account in order to not hurt people. The parameter safe is doing a good job at keeping the images "friendly" but sometimes it's more delicate. Like a few times for the word "down" the program would use a picture of someone having the down syndrome, or for the word "Turkey" it would show the recent earthquake that happened in the country.

There is a lot of political and moral aspects to consider and the program is not capable of having a critical point of view on this.

Improvement 🚀

I'm pretty happy with the result but not perfect by any means.

The AI speech-to-text struggles a lot when it comes to something else than rap music. Which make sense considering how the artists sing. The first improvement would be to find a different way to get the lyrics and timestamps from the audio. With the recent upgrades in the AI world I wouldn't be surprised if that'd be possible soon (or if someone has the solution in the comments lol).

The processing time is very very long (about 4-5 mins for a 20-30 secs clip). The longest part is the audio recognition.

And finally as I said before, we would need a way to prevent any hurtful pictures to appear in the video clips.

Top comments (2)

kaikina profile image

Nice fun project, good job

plsergent profile image
PL Sergent

Thanks :)