DEV Community

loading...
NeuML

Transcribe audio to text

David Mezzetti
Founder/CEO at NeuML — applying machine learning to solve everyday problems. Previously co-founded and built Data Works into a successful IT services company.
・2 min read

This article is part of a tutorial series on txtai, an AI-powered search engine.

This article covers the transcription of audio files to text using models provided by Hugging Face.

Install dependencies

Install txtai and all dependencies.

pip install txtai

# Get test data
wget -N https://github.com/neuml/txtai/releases/download/v2.0.0/tests.tar.gz
tar -xvzf tests.tar.gz
Enter fullscreen mode Exit fullscreen mode

Create a Transcription instance

The Transcription instance is the main entrypoint for transcribing audio to text. The pipeline abstracts transcribing audio into a one line call!

The pipeline executes logic to read audio files into memory, run the data through a machine learning model and output the results to text.

from txtai.pipeline import Transcription

# Create transcription model
transcribe = Transcription("facebook/wav2vec2-large-960h")
Enter fullscreen mode Exit fullscreen mode

Transcribe audio to text

The example below shows how to transcribe a list of audio files to text. Let's transcribe audio to text and look at each result.

from IPython.display import Audio, display

files = ["Beijing_mobilises.wav", "Canadas_last_fully.wav", "Maine_man_wins_1_mil.wav", "Make_huge_profits.wav", "The_National_Park.wav", "US_tops_5_million.wav"]
files = ["txtai/%s" % x for x in files]

for x, text in enumerate(transcribe(files)):
  display(Audio(files[x]))
  print(text)
  print()
Enter fullscreen mode Exit fullscreen mode
Baging mobilizes invasion craft along coast as tiwan tensions escalates
Canada's last fully intact ice shelf has suddenly collapsed forming a manhatten sized ice berg
Main man wins from lottery ticket
Make huge profits without working make up to one hundred thousand dollars a day
National park service warns against sacrificing slower friends in a bare attack
U s virus cases top a million
Enter fullscreen mode Exit fullscreen mode

Overall the results are solid. Each result sounds phonetically like the audio. There is an open task with the Hugging Face models to use a language model to decode the model outputs and further improve result accuracy.

Keep an eye out for those updated models!

Discussion (0)