DEV Community

Cover image for Export text from the video with Python
Stokry
Stokry

Posted on

Export text from the video with Python

In today's post, I will show you how can you export text from the video. We are going to use SpeechRecognition: This is a library for or performing speech recognition with the Google Speech Recognition API.
Also, we will be using moviepy library. MoviePy is a Python library for video editing: cutting, concatenations, title insertions, video compositing (a.k.a. non-linear editing), video processing, and creation of custom effects. MoviePy can read and write all the most common audio and video formats, including GIF, and runs on Windows/Mac/Linux, with Python 2.7+ and 3 (or only Python 3.4+ from v.1.0).
Let's start

import speech_recognition as sr
import moviepy.editor as me
Enter fullscreen mode Exit fullscreen mode

We need to specified, video_file, output_audio_file, and output_text_file

VIDEO_FILE = "test.mp4"
OUTPUT_AUDIO_FILE = "converted.wav"
OUTPUT_TEXT_FILE = "recognized.txt"
Enter fullscreen mode Exit fullscreen mode

The concept will be like this: the script will convert the mp4 file to a wav file, and from that file, it will output text file.
Let's do that - Extracting audio from video

video_clip.audio.write_audiofile(r"{}".format(OUTPUT_AUDIO_FILE))
Enter fullscreen mode Exit fullscreen mode

The next thing we need to do is define the recognizer.

recognizer =  sr.Recognizer()
Enter fullscreen mode Exit fullscreen mode

We need to import audio file for recognition

audio_clip = sr.AudioFile("{}".format(OUTPUT_AUDIO_FILE))
Enter fullscreen mode Exit fullscreen mode

Now the magic begins - we will start the conversion to text

    with audio_clip as source:
        audio_file = recognizer.record(source)
    print("Please wait ...")

    result = recognizer.recognize_google(audio_file)


    with open(OUTPUT_TEXT_FILE, 'w') as file:
        file.write(result)
        print("Speech to text conversion successfull.")

except Exception as e:
    print("Attempt failed -- ", e)
Enter fullscreen mode Exit fullscreen mode

This is the whole code:

import speech_recognition as sr
import moviepy.editor as me

VIDEO_FILE = "video.mp4"
OUTPUT_AUDIO_FILE = "converted.wav"
OUTPUT_TEXT_FILE = "recognized.txt"
try:
    video_clip = me.VideoFileClip(r"{}".format(VIDEO_FILE))
    video_clip.audio.write_audiofile(r"{}".format(OUTPUT_AUDIO_FILE))
    recognizer =  sr.Recognizer()
    audio_clip = sr.AudioFile("{}".format(OUTPUT_AUDIO_FILE))
    with audio_clip as source:
        audio_file = recognizer.record(source)
    print("Please wait ...")
    result = recognizer.recognize_google(audio_file)
    with open(OUTPUT_TEXT_FILE, 'w') as file:
        file.write(result)
        print("Speech to text conversion successfull.")
except Exception as e:
    print("Attempt failed -- ", e)
Enter fullscreen mode Exit fullscreen mode

Note
For longer videos, you can split audio data into chunks.

This is the video that I use for testing purposes: video.
The video is originally uploaded to Youtube and you can find it here: Youtube link.

Thank you all.

Top comments (1)

Collapse
 
dsnr profile image
DSNR

Hey! awesome post. Works brilliantly and helped clear some things up for me with how it works.

How would i track where each word is by some timestamp, to the nearest second?

I would like to return live timestamps for each word along with the transcription.

For clarity.. my end goal is the ability to search for a word and then find all instances of it within a clip and then output them selectively. Essentially giving me 5 files of the word in audio as individual clips, labelled accordingly etc.

Thanks for the great post!