Introduction
In the digital age, creating engaging multimedia content is more accessible than ever. One interesting application is combining text-to-speech (TTS) technology with background music to produce dynamic audio narratives. In this article, we’ll walk through a Python script that does exactly that, leveraging the pydub and gtts libraries to merge spoken text with music. This method is ideal for producing polished audio files perfect for podcasts, audiobooks, or other multimedia projects.
To see a practical example of this technique in action, check out Storyteller4uuu, a YouTube channel that uses a similar approach to create captivating audio stories and narratives.
Getting Started
Before diving into the code, ensure you have the necessary Python libraries installed. You’ll need pydub for audio processing, gtts for converting text to speech, and ffmpeg for handling various audio formats. Install these with:
pip install pydub gtts
You’ll also need ffmpeg, which you can download from FFmpeg's official site and ensure it's accessible from your system PATH.
Step 1: Converting Text to Speech
The first part of our script involves converting text from a file into an audio format using Google Text-to-Speech (gtts). We also add silence between sentences to create a natural pause.
from gtts import gTTS
from pydub import AudioSegment
import os
def text_to_speech(input_file_path, output_file_path, silence_duration_ms=1000, start_silence_ms=2000, end_silence_ms=2000):
try:
# Read the text from the input file
with open(input_file_path, 'r', encoding='utf-8') as file:
text = file.read()
# Convert text to speech
tts = gTTS(text, lang='ru' if any(c in text for c in 'АБВГДЕЁЖЗИИЙКЛМНОПРСТУФХЦЧШЩЬЫЭЮЯ') else 'en')
# Save the converted speech to a temporary MP3 file
temp_file_path = 'temp.mp3'
tts.save(temp_file_path)
# Load the audio file
audio = AudioSegment.from_mp3(temp_file_path)
# Create silence segments
silence_segment = AudioSegment.silent(duration=silence_duration_ms)
start_silence = AudioSegment.silent(duration=start_silence_ms)
end_silence = AudioSegment.silent(duration=end_silence_ms)
# Split the text by periods
segments = text.split('.')
# Create and combine audio segments
audio_segments = []
for i, segment in enumerate(segments):
if segment.strip() == '':
continue
segment_tts = gTTS(segment.strip(), lang='ru' if any(c in segment for c in 'АБВГДЕЁЖЗИИЙКЛМНОПРСТУФХЦЧШЩЬЫЭЮЯ') else 'en')
temp_segment_file = f'temp_segment_{i}.mp3'
segment_tts.save(temp_segment_file)
segment_audio = AudioSegment.from_mp3(temp_segment_file)
audio_segments.append(segment_audio)
if i < len(segments) - 1:
audio_segments.append(silence_segment)
os.remove(temp_segment_file)
# Combine all segments
final_audio = start_silence + sum(audio_segments, AudioSegment.empty()) + end_silence
final_audio.export(output_file_path, format='mp3')
os.remove(temp_file_path)
print(f"Speech successfully saved to {output_file_path}")
except Exception as e:
print(f"An error occurred: {e}")
Step 2: Mixing Speech with Background Music
Once the speech audio is prepared, we mix it with background music. The pydub library helps us handle this task effectively, allowing us to overlay audio files and adjust their properties.
from pydub import AudioSegment
import os
def mix_audio_with_music(speech_file_path, music_file_path, output_file_path, volume_reduction_percent=50, fade_duration_ms=3000):
try:
# Load the speech and music files
speech = AudioSegment.from_mp3(speech_file_path)
music = AudioSegment.from_mp3(music_file_path)
# Adjust the volume of the music
volume_reduction = (volume_reduction_percent / 100.0)
music = music - (10 * volume_reduction)
# Fade-out the music if it's longer than the speech
if len(music) > len(speech):
music = music[:len(speech)]
music = music.fade_out(fade_duration_ms)
# Ensure the length of music matches the length of speech
if len(music) < len(speech):
repeats = int(len(speech) / len(music)) + 1
music = music * repeats
music = music[:len(speech)]
# Mix the audio files
mixed_audio = speech.overlay(music)
mixed_audio.export(output_file_path, format='mp3')
print(f"Mixed audio saved to {output_file_path}")
except Exception as e:
print(f"An error occurred: {e}")
Putting It All Together
Finally, in the main section of your script, define the paths for the input text file, output speech file, and background music. The script then performs the TTS conversion and mixes the resulting speech with the selected music.
if __name__ == "__main__":
# Define file paths
input_file = 'stories/story_2.txt'
output_file = 'output/converted_speech.mp3'
# Create output directory if it doesn't exist
os.makedirs(os.path.dirname(output_file), exist_ok=True)
# Convert text to speech
text_to_speech(input_file, output_file, silence_duration_ms=2000, start_silence_ms=3000, end_silence_ms=3000)
speech_file = 'output/converted_speech.mp3'
music_file = 'music/FIVE_OF_A_KIND_Density_Time.mp3'
final_output_file = 'output/converted_speech_with_music.mp3'
# Mix the audio with music
mix_audio_with_music(speech_file, music_file, final_output_file, volume_reduction_percent=60, fade_duration_ms=3000)
Example in Action
To see a practical example of combining TTS and music, check out Storyteller4uuu on YouTube. This channel effectively uses similar techniques to create engaging and immersive audio stories, demonstrating the potential of this approach.
Conclusion
This Python script demonstrates how to combine text-to-speech and background music to create professional-sounding audio content. By leveraging libraries like pydub and gtts, you can automate the process of generating engaging audio for various applications. Experiment with different parameters and files to tailor the results to your specific needs and enjoy the creative possibilities of multimedia content creation!
Top comments (0)