Intro
We have already probably seen 1000+ videos on YouTube and hopefully gained some knowledge out of them. In this series I would like to explore the possiblities of extracting data from my YouTube history and ending up with personal YT query engine, where through Gemini and embeddings I could converse with all the videos I have watched.
Getting data
You can export your YouTube history either using the Google Data Portability API, if you want to download your data more often and convintiently or do it manually using Google Takeout. Formatting should be the same in both cases, if you select JSON.
Subtitles
Since downloading all the videos and getting the transcription from better models (As far as I know, YouTube doesn't recreate the subtitles with updated STT models) would be probably against ToS, we have to use the subtitles that YouTube provides as a content of the video. YouTube has official API that we can use to download captions for our use-case.
Our data export contains YouTube video IDs, then we can use the list endpoint to get all subtitles avaliable for that video. Sometimes, manually uploaded subtitles are avaliable, which is better option than the STT machine transcription, as they better separate sentences and speakers, which would yield improved results.
Example request:
curl \
'https://youtube.googleapis.com/youtube/v3/captions?part=id%2Csnippet&videoId=yWBzsBaU-Os&key=[YOUR_API_KEY]' \
--header 'Authorization: Bearer [YOUR_ACCESS_TOKEN]' \
--header 'Accept: application/json' \
--compressed
Example response:
{
"kind": "youtube#captionListResponse",
"etag": "09LrB4i7CoUwNzOLNJE6CCffKPU",
"items": [
{
"kind": "youtube#caption",
"etag": "oO2GJpAhG7JcHcqx3d2xYnMumA8",
"id": "AUieDabcbEXa0PfMheqVCKM2A_H-JMA8hrRDlOWDlhTfZnBcA7k",
"snippet": {
"videoId": "yWBzsBaU-Os",
"lastUpdated": "2024-02-25T01:43:01.37359Z",
"trackKind": "asr",
"language": "en",
"name": "",
"audioTrackType": "unknown",
"isCC": false,
"isLarge": false,
"isEasyReader": false,
"isDraft": false,
"isAutoSynced": false,
"status": "serving"
}
}
]
}
Once we get the captions ID, we can use the Download endpoint to get the subtitle file.
curl \ 'https://youtube.googleapis.com/youtube/v3/captions/AUieDabcbEXa0PfMheqVCKM2A_H-JMA8hrRDlOWDlhTfZnBcA7k?key=[YOUR_API_KEY]' \
--header 'Authorization: Bearer [YOUR_ACCESS_TOKEN]' \
--header 'Accept: application/json' \
--compressed
We repeat the process for all the videos and store the results.
Summarization
We can use Gemini to create a summary from the subtitles and title that we can store alongside the embeddings and subtitles. Let's do it!
Google Gemini via Vertex AI
Gemini can run in a few different enviroments, today, we will use Gemini via Google Vertex AI, as it's easily accessible even in Europe. If you are interested in learning the differences between the versions of Gemini, check out this article made by fellow GDE Allen Firstenberg.
Prompt that we will use:
I will give you a video transcription, please return a 2-4 sentence summary of the text and few tags that represent the video well in JSON format with fields summary and tags as array of strings.
Title: ${VIDEO_TITLE}
Transcription: ${TRANSCRIPTION}
Getting started guide for Gemini describes the details pretty well, building of it, let's build example request:
Note: The curl command requires GCP project and gcloud installed. Follow the starter guide if you need to do that.
curl --request POST \
--url https://us-central1-aiplatform.googleapis.com/v1/projects/{YOUR_GOOGLE_PROJECT_ID}/locations/us-central1/publishers/google/models/gemini-1.0-pro:streamGenerateContent \
--header 'Authorization: Bearer $(gcloud auth print-access-token)' \
--header 'Content-Type: application/json; charset=utf-8' \
--data '{
"contents": {
"role": "user",
"parts": {
"text": "I will give you a video transcription, please return a 2-4 sentence summary of the text and few tags that represent the video well in JSON format with fields summary and tags as array of strings. \nTitle: Turning Old Sawmill Blades into Knives | How It’s Made | Science Channel\nTranscription: a sawmill blade lasts 5 to 10 years but when its Jagged edges wear thin it can'\''t cut logs anymore luckily the carbon steel can be salvaged a computerized high-press water tool cuts into a stack of three Sawmill blades in doing so it carves out numerous knife blade blanks the computerized tool also Cuts holes for the handles and a tab for attaching the knife sheath next the blades are transferred to a vibratory tumbler the tumbler is filled with triangular ceramic stones and a soapy solution for several hours the vibrating Stones smooth and clean the blades once complete a technician secures the blades in a fixture with screws the screws hold the blades down flat and lock them in position for the next step"
}
},
"generation_config": {
"temperature": 0.2,
"topP": 0.8,
"topK": 40
}
}'
After merging array of responses, we get a result like this:
{
"summary": "Old sawmill blades are repurposed into knives by cutting them into blanks using a high-press water tool. The blanks are then tumbled in a vibratory tumbler with ceramic stones to smooth and clean them. Finally, the blades are secured in a fixture and screws are used to hold them in place for the next step in the knife-making process.",
"tags": [
"Sawmill blades",
"Knives",
"High-press water tool",
"Vibratory tumbler",
"Ceramic stones",
"Fixture",
"Screws"
]
}
Here we can see that Gemini managed to parse the raw transcription quite well, even though it didn't even have separated senteces. We can save the summary and tags for further processing.
Watch out when running this on a large number of videos or longer videos (15 min+), as it can consume tokens very quickly. I have used this YT shorts video as it can fit nicely into the article.
Thank you for reading and next time we will look at how we can use Google Gemini to create embeddings from the subtitles.
Disclaimer: Google Cloud credits are provided for this project #GeminiSprint
Top comments (0)