Last week we talked about FastAPI. If you missed it, you can check that out here!
Introduction
I decided to write about the audio transcription changes this week, as I finally got some code in place to give users an alternative method. Previously, audio to text was using something called AssemblyAI. However, transcribing 15-20 hours of audio was costing ~$8-15 per month. This code gives users the option to do it for free, though it does take much longer.
Speech Recognition
Speech Recognition is a Python library for performing speech recognition via multiple APIs. It has support for both online and offline APIs, which makes it pretty powerful. For our use-case, I utilized the OpenAI Whisper method.
Here's the full code:
def transcribe_audio_free(file_object):
# get extension
filename, file_extension = os.path.splitext(file_object.name)
# create temp file
with NamedTemporaryFile(suffix=file_extension,delete=False) as temp:
temp.write(file_object.getvalue())
temp.seek(0)
# split file into chunks
audio = AudioSegment.from_file(temp.name)
audio_chunks = split_on_silence(audio,
# experiment with this value for your target audio file
min_silence_len=3000,
# adjust this per requirement
silence_thresh=audio.dBFS-30,
# keep the silence for 1 second, adjustable as well
keep_silence=100,
)
# create a directory to store the audio chunks
folder_name = "audio-chunks"
if not os.path.isdir(folder_name):
os.mkdir(folder_name)
whole_text = ""
# process each chunk
for i, audio_chunk in enumerate(audio_chunks, start=1):
# export audio chunk and save it in the `folder_name` directory.
chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
audio_chunk.export(chunk_filename, format="wav")
# recognize the chunk
try:
# audio to text
r = sr.Recognizer()
uploaded_chunk = sr.AudioFile(chunk_filename)
with uploaded_chunk as source:
chunk_audio = r.record(source)
text = r.recognize_whisper(chunk_audio,"medium")
except sr.UnknownValueError as e:
print("Error:", str(e))
else:
text = f"{text.capitalize()}. "
print(chunk_filename, ":", text)
whole_text += text
# close temp file
temp.close()
os.unlink(temp.name)
# clean up the audio-chunks folders
shutil.rmtree(folder_name)
# return the text for all chunks detected
return whole_text
There's really not much here, so I'll quickly step through the process:
- Creates a temporary file
- Loads temporary file into PyDub and splits it into smaller files
- Creates a directory to store the smaller files
- Iterates through the smaller files a. Places the file into the directory b. Performs speech-to-text via Whisper c. Adds transcribed text to "whole_text" variable
- Closes the temporary file
- Removes the directory
- Returns "whole_text"
NOTE:
You may have to change the values inside of
audio_chunks = split_on_silence()
to better work with your file. 3000, -30, 100 was the sweet spot during testing for me.
You may have to use a different model for Whisper. You can change "medium" to a model that better fits your use-case here:text = r.recognize_whisper(chunk_audio,"medium")
Closing Remarks
Please note that the paid method takes significantly less time and is generally worth using in my opinion. I may work on writing in a progress bar for the free method at some point. Regardless, both methods will be available for use.
Next week, I will begin showing off Veverbot and the mechanisms in place to get him to work. I promise.
Check out the GitHub repo below. You can also find my Twitch account in the socials link, where I will be actively working on this during the week while interacting with whoever is hanging out!
Happy Coding,
Joe
Top comments (0)