DEV Community

David Mezzetti for NeuML

Posted on • Updated on • Originally published at

Transcribe audio to text

This article is part of a tutorial series on txtai, an AI-powered semantic search platform.

This article covers the transcription of audio files to text using models provided by Hugging Face.

Install dependencies

Install txtai and all dependencies. Since this article is using optional pipelines, we need to install the pipeline extras package.

pip install txtai[pipeline]

# Get test data
wget -N
tar -xvzf tests.tar.gz
Enter fullscreen mode Exit fullscreen mode

Create a Transcription instance

The Transcription instance is the main entrypoint for transcribing audio to text. The pipeline abstracts transcribing audio into a one line call!

The pipeline executes logic to read audio files into memory, run the data through a machine learning model and output the results to text.

from txtai.pipeline import Transcription

# Create transcription model
transcribe = Transcription("facebook/wav2vec2-large-960h")
Enter fullscreen mode Exit fullscreen mode

Transcribe audio to text

The example below shows how to transcribe a list of audio files to text. Let's transcribe audio to text and look at each result.

from IPython.display import Audio, display

files = ["Beijing_mobilises.wav", "Canadas_last_fully.wav", "Maine_man_wins_1_mil.wav", "Make_huge_profits.wav", "The_National_Park.wav", "US_tops_5_million.wav"]
files = ["txtai/%s" % x for x in files]

for x, text in enumerate(transcribe(files)):
Enter fullscreen mode Exit fullscreen mode
Baging mobilizes invasion craft along coast as tiwan tensions escalates
Canada's last fully intact ice shelf has suddenly collapsed forming a manhatten sized ice berg
Main man wins from lottery ticket
Make huge profits without working make up to one hundred thousand dollars a day
National park service warns against sacrificing slower friends in a bare attack
U s virus cases top a million
Enter fullscreen mode Exit fullscreen mode

Overall the results are solid. Each result sounds phonetically like the audio. There is an open task with the Hugging Face models to use a language model to decode the model outputs and further improve result accuracy.

Keep an eye out for those updated models!

Discussion (0)