For my tutorial videos I want to provide high-quality subtitles. But I did not want to write them myself as this is a tedious task.
Luckily there is WhisperAI that can help me, or so it promises 😂 Time to give it a shot for my current project.
Install WhisperAI
I followed their installation guide on their GitHub-Repository side. It is a Python tool so the first step I did was setting up a virtual-environment (my current installed version is 3.9.6
on macOS. Here are all the commands I ran:
## I installed python with brew as far as I remember ;-)
## Initialize and activate the virtualenv
python3 -m venv venv
source venv/bin/activate
## Install latest WhisperAI from Github
pip install git+https://github.com/openai/whisper.git
## Install ffmpeg
brew install ffmpeg
Create Util-Script for Easier Handling
With that set I created a little script based on this Gist. It was for an older version of WhisperAI so I had to implement some changes. Save it as createSrt.py
or whatever name you like 😋:
import sys
import whisper
from whisper.utils import get_writer
def run(input_path: str, output_name: str = "", output_directory: str = "./") -> None:
model = whisper.load_model("medium")
result = model.transcribe(input_path)
writer = get_writer("srt", str(output_directory))
writer(result, output_name)
def main() -> None:
if len(sys.argv) != 4:
print(
"Error: Invalid number of arguments.\n"
"Usage: python createSrt.py <input-path> <output-name> <output-directory>\n"
"Example: python createSrt.py transcribed.srt ./transcribed"
)
sys.exit(1)
run(input_path=sys.argv[1], output_name=sys.argv[2], output_directory=sys.argv[3])
if __name__ == "__main__":
main()
Usage of Util-Script
You can call it like this and it can handle .wav
and also .mp4
. So you do not even have to export your videos in another format to use it:
python createSrt.py <path to your mp4/wav file> <name of the srt file> <path to where to save the srt file>
Adjustments
If you want to use another model instead of medium you have to change the following line and replace medium with a model documented here:
model = whisper.load_model("medium")
If you want to change the output format you can use one of the following instead of srt
: vtt
, tsv
, json
, txt
. Change it in the following line:
writer = get_writer("srt", str(output_directory))
Happy transcribing 🦄
Oldest comments (2)
It is interesting, but I got a question: how to fine-tune the timings?
The models are pretty accurate, but I've noticed it also includes the times when no one is speaking.
For example, in one of the audios I've tried, the person only started speaking after a few seconds, but whisper logged in the srt file from 00:00:00. That means that the text appeared in the video way before any voice came out.
I have not found a workaround to going over the transcripts by myself and editing some mistakes.
For example company names are usually not recognized as they should.
Whisper and in my experience every AI tool I have tried so far gets you maybe 95% where you want. If you are ok with that use it as it is. If not you have to invest in the last 5% yourself ;-)