DEV Community

Cover image for Learn how to create a video summarizer using Whisper from OpenAI and BART from Meta on Hugging Face
Niharika Singh ⛓
Niharika Singh ⛓

Posted on

Learn how to create a video summarizer using Whisper from OpenAI and BART from Meta on Hugging Face

Originally posted on: https://writings.niharika.me/how-to-improve-youtube-with-llms

Last update: Wednesday, September 13, 2023

The way humans interact with technology has come a long way. In the early days of computing, users had to input commands using punch cards. At this point, if you have to pause and consciously think about "punch cards," then I proved my point quite effectively;

Humans have indeed come a long long way in how we interact and navigate the digital world.

The story of the metaverse is shaping up rapidly becoming increasingly natural and intuitive. We are leveraging touch, voice, gestures, and probably direct brain connections in the near future.

Lightning-fast advances in Natural Language Processing (NLP) have enabled us to engineer even better user experiences. Just when you think there is probably no more room left for optimization, there is. NLP effectively makes machines understand and generate human language. This creates a significant impact on how we communicate with technology.

🎯 The purpose of this blog is to explore how YouTube can be improved by capitalizing on the latest groundbreaking advancements in LLMs and to create a video summarizer using Whisper from OpenAI and BART from Meta.

How would I improve YouTube

As a product manager, it is key for me to identify challenge areas in a product and strategically devise solutions that not only address these issues but also align with the larger product vision and goals. The product in question here is YouTube.

User groups

I don't think I need to define what YouTube is. So I can skip this part and get to defining the key user segments of YouTube. Broadly speaking, they fall into two categories:

  1. Creators

  2. Consumers

For the purpose of this blog, I will focus on the consumers. Without consumers, creators lose their value. Therefore, it is key to enhance the consumer's experience at every step of the way.

Potential pain points / challenges

  • Misleading thumbnails and titles: Clickbaits. Extremely frustrating for viewers seeking relevant content.
  • Inappropriate content: Despite YouTube's content moderation efforts, viewers may come across content containing offensive, harmful, or inappropriate content.
  • Content duplication: What is more annoying than different creators uploading the same video repeatedly cluttering the search results and making it difficult to find the original video.
  • Limited content discovery: If YouTube's recommendation algorithm doesn't align with the viewer's interest, it can be very difficult to discover new and relevant content.
  • Limited content accessibility: Imagine not being able to understand a video just because you don't know the language it is in. Most of the content by top channels on YouTube is in English.

Ideas that may help improve UX of consumers

  • Natural language interaction with the video: Viewers can read the text summary of the video before jumping into watching a video. Let's say the video is a 50-minute-long panel discussion. This improvement will also save a lot of time of the viewer, helping them make a more informed choice about the content they will consume. The user can also ask questions and get answers in text without necessarily watching the video.

  • Detect and flag inappropriate + duplicate content

  • Enhanced recommendation algorithms

  • Multilingual support and accessibility: Viewers can view the video in any language they like in the same voice as the narrator. This will enhance the reach of every video bringing down the language barriers.

Time to create a video summarizer

Personally, for me clickbaits are annoying and sometimes before committing to watching a long video, I'd like to judge the video based on a text summary. At the moment, I rely on comments on the videos. However, there are instances when there are not many helpful comments or maybe commenting had been turned off.

I will address this painpoint by leveraging LLMs that will generate a text summary of any YouTube video I give as an input.

In a nutshell...

Image description

Web interface

Image description

The user will enter the link to the YouTube video and click on 'Summarize Video' button.

Let me enter this YouTube video and see what comes out.

Here's the output:

Image description

Summary:

"The original iPod was a pioneer in terms of the simple design and easy-to-use functionality. Apple credited the idea of the iPod to an obscure man by the name of Ken Kramer. The rise of the MP3 file format caused a lot of commotion in the music industry."

Quite good, no?

Here's the python script I'm using to do this magic:

import pytube
import requests
import os
from dotenv import find_dotenv, load_dotenv
import openai
import streamlit as st

st.title('YouTube Video Summarization')

load_dotenv(find_dotenv())
openai.api_key = os.getenv("OPENAI_API_KEY")
HUGGINGFACEHUB_API_TOKEN=os.getenv("HUGGINGFACE_API_TOKEN")

# Extract audio from YouTube
def get_audio(video_url):
    # Create a PyTube object for the video.
    youtube_video = pytube.YouTube(video_url)

    # Get the audio stream from the video.
    audio_stream = youtube_video.streams.filter(only_audio=True)

    # Get title
    st.write("Now summarizing: ", youtube_video.streams[0].title)

    # Download the audio stream to a file.
    audio_stream[0].download(output_path="audios", filename="audio.mp3")

# Audio to text
def get_text(filename):
    audio_file= open(filename, "rb")
    transcript = openai.Audio.transcribe("whisper-1", audio_file)
    return transcript

# Summarize
def summarize(transcript):
    API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"
    headers = {"Authorization": f"Bearer {HUGGINGFACEHUB_API_TOKEN}"}

    payload = {
        "inputs": transcript
    }

    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()


youtube_link = st.text_input('YouTube Video:', 'Enter URL here')

def process_input(youtube_link):
    get_audio(youtube_link)
    video_text = get_text("audios/audio.mp3")
    video_text = video_text.text
    summary = summarize(video_text)
    return summary[0]

# Create a button to trigger the function
if st.button("Summarize Video"):
    result = process_input(youtube_link)
    st.write("Summary", result)
Enter fullscreen mode Exit fullscreen mode

Circling back

Injecting LLMs into YouTube to enhance the UX and revolutionize the way we engage with videos on literally any platform is beyond awesome.

By providing video summaries, improving search and discovery, and enhancing accessibility and convenience, YouTube can become an even more user-friendly and informative platform.

As a product manager, I believe that exploring these possibilities and addressing the challenges will lead to a more enriched YouTube experience for everyone. It's time to harness the power of LLMs and take YouTube to the next level of user satisfaction and engagement.

Top comments (0)