Elastic D&D - Update 10 - Audio Transcription Changes

#python #elasticsearch

Last week we talked about FastAPI. If you missed it, you can check that out here!

Introduction

I decided to write about the audio transcription changes this week, as I finally got some code in place to give users an alternative method. Previously, audio to text was using something called AssemblyAI. However, transcribing 15-20 hours of audio was costing ~$8-15 per month. This code gives users the option to do it for free, though it does take much longer.

Speech Recognition

Speech Recognition is a Python library for performing speech recognition via multiple APIs. It has support for both online and offline APIs, which makes it pretty powerful. For our use-case, I utilized the OpenAI Whisper method.

Here's the full code:

def transcribe_audio_free(file_object):
    # get extension
    filename, file_extension = os.path.splitext(file_object.name)

    # create temp file
    with NamedTemporaryFile(suffix=file_extension,delete=False) as temp:
        temp.write(file_object.getvalue())
        temp.seek(0)

        # split file into chunks
        audio = AudioSegment.from_file(temp.name)
        audio_chunks = split_on_silence(audio,
            # experiment with this value for your target audio file
            min_silence_len=3000,
            # adjust this per requirement
            silence_thresh=audio.dBFS-30,
            # keep the silence for 1 second, adjustable as well
            keep_silence=100,
        )

        # create a directory to store the audio chunks
        folder_name = "audio-chunks"
        if not os.path.isdir(folder_name):
            os.mkdir(folder_name)
        whole_text = ""

        # process each chunk 
        for i, audio_chunk in enumerate(audio_chunks, start=1):
            # export audio chunk and save it in the `folder_name` directory.
            chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
            audio_chunk.export(chunk_filename, format="wav")
            # recognize the chunk
            try:
                # audio to text
                r = sr.Recognizer()
                uploaded_chunk = sr.AudioFile(chunk_filename)
                with uploaded_chunk as source:
                    chunk_audio = r.record(source)
                text = r.recognize_whisper(chunk_audio,"medium")
            except sr.UnknownValueError as e:
                print("Error:", str(e))
            else:
                text = f"{text.capitalize()}. "
                print(chunk_filename, ":", text)
                whole_text += text

        # close temp file
        temp.close()
        os.unlink(temp.name)

    # clean up the audio-chunks folders
    shutil.rmtree(folder_name)

    # return the text for all chunks detected
    return whole_text

There's really not much here, so I'll quickly step through the process:

Creates a temporary file
Loads temporary file into PyDub and splits it into smaller files
Creates a directory to store the smaller files
Iterates through the smaller files a. Places the file into the directory b. Performs speech-to-text via Whisper c. Adds transcribed text to "whole_text" variable
Closes the temporary file
Removes the directory
Returns "whole_text"

NOTE:

You may have to change the values inside of audio_chunks = split_on_silence() to better work with your file. 3000, -30, 100 was the sweet spot during testing for me.
You may have to use a different model for Whisper. You can change "medium" to a model that better fits your use-case here: text = r.recognize_whisper(chunk_audio,"medium")

Closing Remarks

Please note that the paid method takes significantly less time and is generally worth using in my opinion. I may work on writing in a progress bar for the free method at some point. Regardless, both methods will be available for use.

Next week, I will begin showing off Veverbot and the mechanisms in place to get him to work. I promise.

Check out the GitHub repo below. You can also find my Twitch account in the socials link, where I will be actively working on this during the week while interacting with whoever is hanging out!

GitHub Repo
Socials

Happy Coding,
Joe

DEV Community

Elastic D&D - Update 10 - Audio Transcription Changes

Introduction

Speech Recognition

Closing Remarks

Top comments (0)

Read next

Introducing File Cleaner Pro: Your AI-Powered Digital Space Organizer

Understanding Row-level locking in databases.

How to build a crashproof customer service agent in <80 lines with Swarm 💪🐝

A Guide to Unsupervised Image Segmentation using Normalized Cuts (NCut) in Python