Transcribing YouTube Videos using OpenAI’s Whisper📽️🗣️

Although YouTube has emerged as the standard for video sharing and information gathering, not everyone has the time or capacity to watch a video through to the end. A tool for transcribing these movies can be useful in these situations. Today, well look at how to use AI to create your own YouTube transcriber.

Well also look at how Replicate may be used to scale up and offload the transcription process, as well as how to use natural language processing to summarise the finished video transcription.

What is OpenAIs Whisper? 🗣🤖

Whisper is an automatic speech recognition system trained on multilingual and multitask supervised data created by OpenAI. It transcribes audio and video footage with astounding accuracy using cutting-edge deep learning models, making it simple to glean insightful information from massive amounts of spoken data.

Whisper has a wide range of potential uses, but well be using it especially to record audio from YouTube videos.

Getting started 👶🏻

For these examples, Python 3 will be used because Whisper is available in this dialect.

Virtual Environment Setup 🏞

Generally speaking, its a good idea to separate your package installations when starting a new Python project. By building a virtual environment, we may do this.

python3 -m venv venv

This will create your virtual environment in a folder called venv. From here, we can then activate it:

. venv/bin/activate

Installing Dependencies 📦

Well use pip to install the packages needed:

pip install openai-whisper openai yt-dlp

  1. openai-whisperWhisper model and API

  2. openaiGPT-3 interface for natural language processing

  3. yt-dlplibrary for extracting YouTube data

Fetching the YouTube Audio Stream 📽

To give us something to work with, Ive provided a short example video below of a TED-Ed video.

We can then extract the data streams and remove the audio from the video using the video ID:

import yt_dlp

def download(video_id: str) -> str:
    video_url = f'{video_id}'
    ydl_opts = {
        'format': 'm4a/bestaudio/best',
        'paths': {'home': 'audio/'},
        'outtmpl': {'default': '%(id)s.%(ext)s'},
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'm4a',
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code =[video_url])
        if error_code != 0:
            raise Exception('Failed to download video')

    return f'audio/{video_id}.m4a'

def main():
    # The video ID of the embedded video above. 
    file_path = download('bFIVYRfyb3E')

This will download the above video as audio/bFIVYRfyb3E.m4a

Transcribing the Audio File 🤖

Now that we have the audio file on hand, we can simply feed it into Whisper:

import whisper
# You can adjust the model used here. Model choice is typically a tradeoff between accuracy and speed.
# All available models are located at
whisper_model = whisper.load_model("base.en")

def transcribe(file_path: str) -> str:
    # `fp16` defaults to `True`, which tells the model to attempt to run on GPU.
    # For local demonstration purposes, we'll run this on the CPU by setting it to `False`.
    transcription = whisper_model.transcribe(file_path, fp16=False)
    return transcription['text']

def main():
    transcript = transcribe('audio/bFIVYRfyb3E.m4a')

This will generate the full transcript for the video:

Video to text

Check the Video to text here : Video to text

Generating a Transcript Summary 📃

In their videos, many YouTube creators incorporate sponsorships, adverts, and filler content. With the aid of natural language processing, we can create a transcript summary that condenses the transcript into a more manageable form. For this example, we will create these summaries using the widely used gpt-3.5-turbo model.

To create an API key, you must have an OpenAI account. You will be given some free usage as a new user to try out the API.

import openai
openai.api_key = "<YOUR_OPENAI_API_KEY>"

def generate_summary(transcript: str) -> str:
    # Generate a summary of the transcript using OpenAI's gpt-3.5-turbo model.
    resp = openai.ChatCompletion.create(
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f'Summarize this: {transcript}'},
    return resp['choices'][0]['message']['content']

def main():
    transcript = transcribe('audio/bFIVYRfyb3E.m4a')
    summary = generate_summary(transcript)

Although there may be variations in the outcomes, the following is an illustration of what to anticipate:


Check the summary here : Summary

Have fun! To customize the answer to your objectives, you can adjust this in a variety of ways.

Optional: Scale with Replicate 🏻

We can run open-source models in the cloud thanks to Replicate. This might be a priceless tool for expanding your application, depending on your use case.

Create an account with Replicate to get an API token if you want to utilise it. We install the Replicate client using pip in order to use it in our code:

pip install replicate

Now that Whisper through Replicate is enabled, we can adjust the transcribe code above to take this into consideration instead of only using the local CPU:

def transcribe(file_path: str, use_replicate: bool = False) -> str:
    if use_replicate:
        client = replicate.Client(api_token='xxxxx')
        transcription = 'openai/whisper:30414ee7c4fffc37e260fcab7842b5be470b9b840f2b608f5baa9bbef9a259ed',
            input={'audio': open(file_path, 'rb')}, language='en', model='base'
        transcription = whisper_model.transcribe(file_path, fp16=False)['text']

    return transcription

Conclusion 💭

Whisper is a strong tool for creating transcribers that can effectively glean insights from audio and video sources. Whisper can help you optimize your workflow and discover fresh insights from your content, whether youre a content creator trying to reuse your video content, a researcher analysing data from video interviews, or anybody else who deals with spoken data.

