Introduction
In the following tutorial, I will build a basic app to recognize audio files and convert them to text. Moreover, with OpenAI API tools, a Python library "pydub" for audio manipulation, and the library python-dotenv to save environment variables, it's easy to do. Without cumbersome code, it is easy to follow with a detailed explanation to make it work in your daily tasks.
Let's get dirty:
Clone the repository:
git clone https://github.com/ivansing/audio-to-text-app.git
cd audio-to-text-app
You should get the sample files in assets
and copy them to your assets
folder.
Setting Up the Environment
Prerequisites:
- Basic Python language
- Code Editor
- Basic command line
Step 1: Setting Up the Project
- Install Python Here. It is straightforward; just follow the (recommended) prompts to "yes."
- I will use VS Code as managing any project and development is relatively easy.
- Open VS Code at the top of the bar and press
Terminal
in the dropdown box. SelectNew Terminal
and type the following bash command:
mkdir audio-text-app
Then move to the directory that we did previously:
cd audio-text-app
Your projects would then be located in paths like:
/home/your-username/Projects/my_project
(Linux)
/Users/your-username/Projects/my_project
(Mac)
In your folder audio-text-app
create the following files:
touch audio-to-text.py .env
The file audio-text-app
is for this small script app's main functionality and entry point.
.env
is where I will save the API keys from OpenAi. I will use it in the following steps.
For Windows using Linux subsystem WSL
- Open VS Code, press F1, and select connect to WSL.
- Follow the previous steps from Linux/Mac.
Step 2: Install Required Libraries and folders files
- In your command line windows, type the following instructions:
pip install openai pydub python-dotenv
-
Install FFmpeg:
- On macOS (using Homebrew): brew install ffmpeg
- On Ubuntu: sudo apt install ffmpeg
- On Windows: Download and install from ffmpeg.org
Make another directory named:
assets
inside the audio-text-app folder. I will use it to get the audio.wav
files for testing:
mkdir assets
Step 3: Setup your OpenAI API Key:
- Go to OpenAI and sign up to generate your openai-api-key.
- Then, Create a new secret key, which is the openai-api-key.
- Follow the steps in the popup modal window, and press "Create secret key."
- Finally, Save this generated key in a notepad or safe place, always hidden from the public, and press "Done."
Now that we have already generated our precious hidden key let's keep the following:
From the previous steps, you can now paste that secret key created in the .env
file:
OPENAI_API_KEY=<YOUR-API-SECRET-KEY> # Just paste the same format you copy. Don't change or add anything!
Step 4: Write the code
- Import the libraries:
import openai
from pydub import AudioSegment
import os
import uuid
from dotenv import load_dotenv
-
open
: For interaction with the OpenAI Whisper API to transcribe audio. -
pydub
: This will make the file smaller so as not to stress too much the CPU work when handling audio file manipulations like changing channels (mono) and resampling. -
os
: To generate unique file names for processed audio, removing redundant output files. -
uuid
: To generate unique file names for processed audio. -
dotenv
To load environment variables from a.env
file, which securely stores the API key.
Functions
convert_to_mono_16k
def convert_to_mono_16k(audio_file_path, output_dir="assets"):
"""Converts audio to mono and 16kHz, returns the path to the converted audio."""
sound = AudioSegment.from_file(audio_file_path)
sound = sound.set_channels(1) # Mono
sound = sound.set_frame_rate(16000) # 16kHz
# Generate a unique filename for the mono version
converted_file_name = f"{uuid.uuid4()}.wav"
converted_file_path = os.path.join(output_dir, converted_file_name)
# Export the converted audio file
sound.export(converted_file_path, format="wav")
return converted_file_path
This function takes an audio file, converts it to mono (1 audio channel), and resamples it to 16kHz, which is required for optimal transcription with Whisper.
- The audio file is loaded using
AudioSegment
. - It is converted to mono with
set_channels(1)
. - The sample rate is set to 16kHz using
set_frame_rate(16000)
. - A unique file name is generated using
uuid
to avoid filename conflicts. - The processed audio file is exported to the specified output directory (
assets
by default). - This function returns the file path of the converted audio, which will be used later for transcription.
transcribe_audio
def transcribe_audio(audio_file_path, clean_up=True):
"""Transcribes audio to text using OpenAI's Whisper."""
# Convert audio to mono and 16kHz
mono_audio_path = convert_to_mono_16k(audio_file_path)
# Transcribe audio using OpenAI's Whisper
with open(mono_audio_path, "rb") as audio_file:
transcript = openai.Audio.transcribe("whisper-1", audio_file)
# Clean up the converted file if needed
if clean_up:
os.remove(mono_audio_path)
return transcript['text']
This function transcribes an audio file into text using the OpenAI Whisper API.
- It calls the
convert_to_mono_16k
function to ensure the audio is in the correct format (mono, 16kHz). - The converted file is opened in binary mode
"rb"
and passed to the Whisper API transcription. - The function optionally cleans up (deletes) the temporary audio file after the transcription, controlled by the
clean_up
argument. - The function returns the transcription text extracted from the Whisper API's response.
Test code
# Example usage
audio_file = "assets/jackhammer.wav"
transcription = transcribe_audio(audio_file)
print("Transcription:", transcription)
This section demonstrates how to use the transcribe_audio
function.
Besides the samples that are stored in the assets
folder, you can add more .wav
files to test it
Test the code with the following command:
python3 audio-to-text.py
Now check the output text from the audio file in the terminal:
- The
audio_file
variable specifies the audio file to be transcribed. - The
transcribre_audio
function is called with the audio file path. - The transcription result is printed to the console.
Summary
It was an ideal tutorial for learning the basis of using various Python libraries. After all, we learned the OpenAI API Whisper, which is a trained model based on a neural net on English speech recognition. And the use of pydub
to manipulate the audio. I used native Python libraries os
for path source and definition and uuid
to rename the mono output file.
Conclusion
Python is a vast universe that is used in general software construction. You can use this tool as part of a small software package. It is an essential program that needs many more things, like more test cases than you can imagine. Ideally, you can have an output text file, but for this short tutorial, I didn't want to add more complexity; if you're going to add more features, look at Python docs, and you will be amazed at what you can manipulate in this world and with the help of API (outside programs to communicate), there will be fantastic software builds.
References
About the Author
Ivan Duarte is a backend developer with experience working freelance. He is passionate about web development and artificial intelligence and enjoys sharing their knowledge through tutorials and articles. Follow me on X, Github, and LinkedIn for more insights and updates.
Top comments (0)