The 20th Day of the 100 Days of Code challenge

Today, I made a working demo for my real-time transcription app that uses Whisper and has Speaker Diarization as well.

The problem I've been having for ages with transmitting valid audio I can save to a file from the client was quickly solved by taking inspiration from another project that transcribes audio in real-time, just using Google's API.

A project by Sahar Mor, saved me a lot of time by already having the general logic needed for this kind of application that multiple clients might be using.

A quick look at the backend outlined the process:

Client identification by Session ID, allowing us to easily access and store its transcription data
Separate transcription thread to keep the flow coming.

Here are the current challenges:

First challenge:
I don't wanna say anything about any library, but I feel like choosing to extract speaker embeddings with pyannote's model was a poor choice. Since the embeddings are the most important part (there's no need to diarizate audio recordings as Whisper provides segments when transcribing), and the segments can be very short, it's important to have an amazingly accurate library to extract them. Aiming to make use of a library that has a larger dataset and is more renowned for its accuracy.

Second challenge:
Transcription accuracy.
Whisper's transcription accuracy is as good as it gets, but that depends on the model. For some reason, I get execution times as long as 35 seconds for a few-second audio file when using the large model. However, a friend said he managed to get under 2 seconds when running the server in Google Colab. That's what I've got on the agenda for tomorrow.

I initialize the model when the server initializes, I have no idea why it takes so long when running on PyCharm. I think that I've done something wrong when calling it.

The current demo isn't very promising at first sight, but the fixes don't appear to be too demanding.

The accuracy of the Speaker Diarization will improve when the approach improves, and the same goes for the transcription accuracy.

That's it for today,
Happy coding everyone!