DEV Community

Cover image for Write a video translation and voiceover tool in Python
haha512
haha512

Posted on

Write a video translation and voiceover tool in Python

Recently, I've been researching video translation and came across a method of displaying subtitles in a different language and dubbing the audio in that language after processing a video in another language. I have achieved this effect with the following:

Open source code on GitHub

This project can translate and dub videos from one language to another. The speech recognition is based on the offline model, openai-whisper, while text translation uses the Google Translate API. Text-to-speech synthesis utilizes Microsoft Edge TTS, and background audio removal is facilitated by Spleeter. No commercial interfaces need to be purchased or paid for.

project demo

Initially, my intention was to convert speech recognition into text and generate subtitles. However, I couldn't find a suitable text-to-speech synthesis solution that met the criteria of natural voice quality, high accuracy, and easy installation.

For instance, I explored options like Facebook's
seamless_communication and Mozilla TTS, as well as models from https://huggingface.co/. Unfortunately, the results were not satisfactory, and it seemed like all options, except for specific training, yielded poor performance and lack of usability.

Suddenly, I had an idea these days. The Edge browser has a built-in text-to-speech feature, and since most users have Edge installed on Windows 10 and Windows 11, I thought of leveraging the "Edge TTS" API. I wasted no time and searched for Edge TTS-related projects on GitHub, looked into them, and resumed the unfinished task of audio dubbing.

Here is the overall approach:

Technology stack: Python 3.10 + FFmpeg + OpenAI-Whisper offline model + Spleeter

Extract the audio from the video and divide it into segments of silence using the "pydub" library for convenient recognition. Installation command: pip install pydub.

pydub.silence.detect_silence(
    normalized_sound, 
    min_silence_len=min_silence
)
# Returns a 2D list of start time stamps, for example:
# [[0, 5000], [6000, 10000]]

Use the "openai-whisper" speech recognition library:

r = speech_recognition.Recognizer()
text = r.recognize_whisper(audio_listened, language="en")

"text" contains the recognized text.

Translate the recognized text into the desired language using the Google Translate API. I employ the "requests" library to directly scrape the Google Translate page and extract the result.

def googletrans(text, src, dest):
    url = f"https://translate.google.com/m?sl={urllib.parse.quote(src)}&tl={urllib.parse.quote(dest)}&hl={urllib.parse.quote(dest)}&q={urllib.parse.quote(text)}"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    proxies = None
    if "http_proxy" in os.environ:
        proxies = {
            'http': os.environ['http_proxy'],
            'https': os.environ['https_proxy']
        }
    try:
        response = requests.get(url, proxies=proxies, headers=headers, timeout=40)
        if response.status_code != 200:
            return f"error translation code={response.status_code}"
        re_result = re.findall(
            r'(?s)class="(?:t0|result-container)">(.*?)<', response.text)
    except:
        return "[error google api] Please check the connectivity of the proxy or consider changing the IP address."
    return "error on translation" if len(re_result) < 1 else re_result[0]

Install the "srt" library (pip install srt) to merge the translated text into SRT subtitle format:

sub = srt.Subtitle(index=index, start=start, end=end, content=text)

Resulting merged SRT file:

1
00:00:00,000 --> 00:01:00,000
Nowadays, the voice experience has become a big deal in the business world. To achieve a good experience, you need real-time and accurate transcription as the foundation. However, most Automatic Speech Recognition Services (ASRS) are built on technology that has been around for over 50 years.

2
00:01:00,000 --> 00:   

Encountered Challenges

  1. Placing FFmpeg directly in the project directory and locating it by modifying os.environ['path'].

  2. The issue of audio and video misalignment.

    When translating the same sentence from Chinese to English, the required time is likely to be different. This results in a discrepancy in the duration of the audio clip, which can be either extended to 10 seconds or reduced to 5 seconds. Therefore, I have included an option to adjust the speech rate according to the needs, either lowering or increasing it. Additionally, an automatic speech rate adjustment feature has been added. If the translated audio duration exceeds the original duration, the translated audio is played at an accelerated speed to achieve alignment.

Wrapping a GUI Interface with tkinter

  1. Using the built-in standard GUI library in Python, tkinter, to create a simple interface. Additionally, for easier layout, the PySimpleGUI library, which is a wrapper for tkinter, is utilized.
  2. The application can be packaged as an executable using pyinstaller by executing pyinstall -w sp.py.

GUI Interface

GUI demo

subtitle

CLI Mode

The application now includes a CLI mode. After deploying the source code, you can execute python cli.py to perform translations via the command line.

Supported Parameters:

--source_mp4: [Required] Path of the video to be translated, ending with .mp4.
--target_dir: Translation output directory for the video. By default, it is stored in the "_video_out" folder within the source video directory.

--source_language: Video language code, default is en ( zh-cn | zh-tw | en | fr | de | ja | ko | ru | es | th | it | pt | vi | ar )
--target_language: Target language code, default is zh-cn ( zh-cn | zh-tw | en | fr | de | ja | ko | ru | es | th | it | pt | vi | ar )

--proxy: Fill in the HTTP proxy address. By default, it is None. If Google cannot be accessed in your region, you need to provide the proxy address. For example: http://127.0.0.1:10809.

--voice_replace: Provide the corresponding voice role name based on the target language code. The first two letters of the role name should match the first two letters of the target language code. If you're unsure how to fill this parameter, execute python cli.py show_voice to display available character names for each language.

--voice_rate: Adjust the speech rate. Negative values slow down the speech, while positive values accelerate it. The default value is 10, indicating acceleration.
--remove_background: Determines if the background sound should be removed. Passing this parameter indicates background removal.

--voice_silence: Enter a number between 100 and 2000 to represent the minimum duration (in milliseconds) of silence. The default value is 300.

--voice_autorate: If the duration of the translated audio exceeds the original duration, should the translated audio be played back at an accelerated speed to align the durations?

--whisper_model: Default is "base". Other options include "small", "medium", and "large". A smaller model size provides better efficiency but slower processing speed.

CLI Examples

cli demo


bash
python cli.py --source_mp4 "D:/video/ex.mp4" --source_language en --target_language zh-cn --proxy "http://127.0.0.1:10809" --voice_replace zh-CN-XiaoxiaoNeural


Enter fullscreen mode Exit fullscreen mode

The above command translates the video located at "D:/video/ex.mp4" from English to Chinese, using the voice role "zh-CN-XiaoxiaoNeural" with the specified proxy address "http://127.0.0.1:10809".


bash
python cli.py --source_mp4 "D:/video/ex.mp4" --source_language zh-cn --target_language en --proxy "http://127.0.0.1:10809" --voice_replace en-US-AriaNeural --voice_autorate --whisper_model small


Enter fullscreen mode Exit fullscreen mode

The above command translates the video located at "D:/video/ex.mp4" from Chinese to English, using the voice role "en-US-AriaNeural", and automatically accelerates the translated audio if its duration exceeds the original duration. The whisper model used in text recognition is "small".

Results

Click here to view demo comparison results

preview

GitHub Repository

https://github.com/jianchang512/pyvideotrans

Referenced Open Source Projects

https://github.com/jiaaro/pydub

https://github.com/rany2/edge-tts

https://github.com/facebookresearch/seamless_communication

https://github.com/coqui-ai/TTS

https://github.com/deezer/spleeter

https://github.com/openai/whisper

Top comments (1)

Collapse
 
devopsking profile image
UWABOR KING COLLINS

awesome... meanwhile i just wrote some outsttanding

Good morning, everyone! 🌞 Ever wondered how your favorite platforms like Netflix and YouTube suggest content? I designed a top-notch Recommendation System.πŸ” As I enhance my System Design skills, which is a crucial skill for Solution Architects and Software Engineers, I've explained each architecture choice and my algorithm selections. Discover how recommendations make your streaming experience personalized and engaging!

dev.to/devopsking/maximizing-strea...