Introduction
In one of my recent articles, I discussed about Speech Recognition and how to implement it in Python. In today’s article we will go a step further and explore how to perform topic detection over audio and video files.
As an example let’s consider podcasts which are becoming more and more popular over time. Imagine how many podcasts are created on a daily basis; Wouldn’t be useful for recommendation engines on different platforms (such as Spotify, YouTube or Apple Podcasts) to somehow categorise all these podcasts based on the content discussed?
Performing Speech-to-Text and Topic Detection with Python
In this tutorial, we will be using AssemblyAI API in order to label topics that are spoken in audio and video files. Therefore, if you want to follow along you first need to obtain an AssemblyAI access token (which is absolutely free) that we will be using when calling the API.
Now that we have an access token, let’s start by preparing the headers that will be using when sending requests to the various endpoints for AssemblyAI.
import requests
API_KEY = <your AssemblyAI API key goes here>
# Create the headers for request
headers = {
'authorization': API_KEY,
'content-type': 'application/json'
}
Going forward, we then need to upload our audio (or video) file to the hosting service of AssemblyAI. The endpoint is then going to return as the URL of the uploaded file that we will be using in subsequent requests.
AUDIO_FILE = '/path/to/your/audio/file.mp3'
UPLOAD_ENDPOINT = 'https://api.assemblyai.com/v2/upload'
def read_audio_file(file):
"""Helper method that reads in audio files"""
with open(file, 'rb') as f:
while True:
data = f.read(5242880)
if not data:
break
yield data
res_upload = requests.post(
UPLOAD_ENDPOINT,
headers=headers,
data=read_audio_file(AUDIO_FILE)
)
upload_url = res_upload.json()['upload_url']
"""
Example response from AssemblyAI upload endpoint
pprint(res_upload.json())
{'upload_url': 'https://cdn.assemblyai.com/upload/b017e8c0-b31a-4d09-9dc2-8dee0ee0d3c8'}
"""
Now the next step is the most interesting part where we will be performing Speech-to-Text over the uploaded audio file. All we need to pass in the POST
request is the audio_url
that we received from the previous step along with iab_categories
parameter that needs to be set to True
. The latter is going to trigger topic detection over the text transcription. An example response from the TRANSCRIPT_ENDPOINT
is also shown as a comment at the end of the code block below.
TRANSCRIPT_ENDPOINT = 'https://api.assemblyai.com/v2/transcript'
res_transcript = requests.post(
TRANSCRIPT_ENDPOINT,
headers=headers,
json={
'audio_url': upload_url,
'iab_categories': True,
},
)
res_transcript_json = res_transcript.json()
"""
Example response from transcript endpoint
print(res_transcript_json)
{
'id': 'w9w13r544-8459-4b06-8d7a-1e9accee4b61',
'language_model': 'assemblyai_default',
'acoustic_model': 'assemblyai_default',
'language_code': 'en_us',
'status': 'queued',
'audio_url': 'https://cdn.assemblyai.com/upload/4eb47686-249d-4d48-9b79-62aea715d735',
'text': None,
'words': None,
'utterances': None,
'confidence': None,
'audio_duration': None,
'punctuate': True,
'format_text': True,
'dual_channel': None,
'webhook_url': None,
'webhook_status_code': None,
'speed_boost': False,
'auto_highlights_result': None,
'auto_highlights': False,
'audio_start_from': None,
'audio_end_at': None,
'word_boost': [],
'boost_param': None,
'filter_profanity': False,
'redact_pii': False,
'redact_pii_audio': False,
'redact_pii_audio_quality': None,
'redact_pii_policies': None,
'redact_pii_sub': None,
'speaker_labels': False,
'content_safety': False,
'iab_categories': True,
'content_safety_labels': {},
'iab_categories_result': {}
}
"""
Now in order to get the transcription result (along with the topic detection results) we need to make one more request. This is because the transcription is asynchronous — when a file is submitted for transcription it will need some time until we can access the result (typically in about 15–30% time of the overall audio file duration).
Therefore, we need to make a few GET
requests until we get successful (or failure) response as illustrated below.
import os
import sys
from time import sleep
status = ''
while status != 'completed':
res_result = requests.get(
os.path.join(TRANSCRIPT_ENDPOINT, res_transcript_json['id']),
headers=headers
)
status = res_result.json()['status']
print(f'Status: {status}')
if status == 'error':
sys.exit('Audio file failed to process.')
elif status != 'completed':
sleep(10)
Finally, let’s write the received result into a text file so that it will be easier for us to inspect the output and interpret the response received from the transcription endpoint:
OUTPUT_TRANSCRIPT_FILE = 'speech-to-text-tutorial.txt'
with open(OUTPUT_TRANSCRIPT_FILE, 'w') as f:
f.write(res_result.json()['text'])
print(f'Transcript file saved under {OUTPUT_TRANSCRIPT_FILE}')
Interpreting the response
An example response from the transcription endpoint is shown below:
{
...
"id": "audio-transcription-id",
"status": "completed",
"text": "Ted Talks are recorded live at Ted Conference..."
"iab_categories_result": {
"status": "success",
"results": [
{
"text": "Ted Talks are recorded live at Ted Conference...",
"labels": [
{
"relevance": 0.02298230677843094,
"label": "Technology&Computing>Computing>ComputerSoftwareAndApplications>WebConferencing"
},
{
"relevance": 0.00561910355463624,
"label": "Education>OnlineEducation"
},
{
"relevance": 0.00465833256021142,
"label": "MusicAndAudio>TalkRadio"
},
{
"relevance": 0.002487020567059517,
"label": "Hobbies&Interests>ContentProduction"
},
{
"relevance": 0.0012438222765922546,
"label": "BusinessAndFinance>Business>ExecutiveLeadership&Management"
},
{
"relevance": 0.0010610689641907811,
"label": "Technology&Computing>Computing>Internet>SocialNetworking"
},
{
"relevance": 0.0008706427761353552,
"label": "Careers>RemoteWorking"
},
{
"relevance": 0.0005944414297118783,
"label": "Religion&Spirituality>Spirituality"
},
{
"relevance": 0.00039072768413461745,
"label": "Television>RealityTV"
},
{
"relevance": 0.00036419558455236256,
"label": "MusicAndAudio>TalkRadio>EducationalRadio"
}
],
"timestamp": {
"start": 8630,
"end": 32990
}
},
...
],
"summary": {
"MedicalHealth>DiseasesAndConditions>BrainAndNervousSystemDisorders": 1.0,
"FamilyAndRelationships>Dating": 0.7614801526069641,
"Shopping>LotteriesAndScratchcards": 0.6330153346061707,
"Hobbies&Interests>ArtsAndCrafts>Photography": 0.6305723786354065,
"Style&Fashion>Beauty": 0.5269057750701904,
"Education>EducationalAssessment": 0.49798518419265747,
"BooksAndLiterature>ArtAndPhotographyBooks": 0.45763808488845825,
"FamilyAndRelationships>Bereavement": 0.45646440982818604,
"FineArt>FineArtPhotography": 0.3921416699886322,
"NewsAndPolitics>Politics>Elections": 0.3911418318748474,
"Technology&Computing>ConsumerElectronics>CamerasAndCamcorders": 0.37802764773368835,
"Technology&Computing>ArtificialIntelligence": 0.3659703731536865,
"PopCulture>CelebrityScandal": 0.30767935514450073,
"FamilyAndRelationships": 0.30298155546188354,
"Education>EducationalAssessment>StandardizedTesting": 0.2812648415565491,
"Sports>Bodybuilding": 0.2398379147052765,
"Education>HomeworkAndStudy": 0.20159155130386353,
"Style&Fashion>BodyArt": 0.19066567718982697,
"NewsAndPolitics>Politics>PoliticalIssues": 0.18915779888629913,
"FamilyAndRelationships>SingleLife": 0.15354971587657928
}
},
}
The outer text
key contains the result of the text transcription over the input audio file. But let’s focus more on the content of categories_iab_result
that contains information relevant to the Topic Detection result.
-
status
: Contains the status of the topic detection. Normally, this will besuccess
. If for any reason the Topic Detection model has failed the value will then beunavailable
. -
results
: This key will include a list of topics that were detected over the input audio file, including the precise text that influenced the prediction and triggered the prediction model to make this decision. Additionally, it includes some metadata about relevance and timestamps. We will discuss about both below. -
results.text
: This key includes the precise transcription text for the portion of audio that has been classified with a particular topic label. -
results.timestamp
: This key indicates the starting and ending time (recorded in milliseconds) for where theresults.text
was spoken in the input audio file. -
results.labels
: This is a list containing all the labels that were predicted by the Topic Detection model for the portion of text inresults.text
. The relevance key corresponds to a score that can take any value between0
and1.0
and indicates how relevant each predicted label in relation toresults.text
. -
summary
: For every unique label detection by the Topic Detection model in theresults
array, thesummary
key will include the relevancy for that label across the entire length of the input audio file. For example, if theScience>Environment
label is detected only once in a 60-minute long audio file, the summary key will include a relatively low relevancy score for that label, since the entire transcription was not found to be consistently relevant to that topic label.
In order to see the full list of topic labels that the Topic Detection model is capable of predicting, make sure to check the relevant section in the official documentation.
Full Code
The full code used as part of this tutorial is shared below:
import os
import sys
import requests
from time import sleep
API_KEY = <your AssemblyAI API key goes here>
AUDIO_FILE = '/path/to/your/audio/file.mp3'
UPLOAD_ENDPOINT = 'https://api.assemblyai.com/v2/upload'
TRANSCRIPT_ENDPOINT = 'https://api.assemblyai.com/v2/transcript'
OUTPUT_TRANSCRIPT_FILE = 'speech-to-text-tutorial.txt'
def read_audio_file(file):
"""Helper method that reads in audio files"""
with open(file, 'rb') as f:
while True:
data = f.read(5242880)
if not data:
break
yield data
# Create the headers for request
headers = {
'authorization': API_KEY,
'content-type': 'application/json'
}
res_upload = requests.post(
UPLOAD_ENDPOINT,
headers=headers,
data=read_audio_file(AUDIO_FILE)
)
upload_url = res_upload.json()['upload_url']
res_transcript = requests.post(
TRANSCRIPT_ENDPOINT,
headers=headers,
json={
'audio_url': upload_url,
'iab_categories': True,
},
)
res_transcript_json = res_transcript.json()
status = ''
while status != 'completed':
res_result = requests.get(
os.path.join(TRANSCRIPT_ENDPOINT, res_transcript_json['id']),
headers=headers
)
status = res_result.json()['status']
print(f'Status: {status}')
if status == 'error':
sys.exit('Audio file failed to process.')
elif status != 'completed':
sleep(10)
with open(OUTPUT_TRANSCRIPT_FILE, 'w') as f:
f.write(res_result.json()['text'])
print(f'Transcript file saved under {OUTPUT_TRANSCRIPT_FILE}')
Final Thoughts
In today’s article we explored how to perform Speech-to-Text and Topic Detection over the generated text transcription using Python and AssemblyAI API. We wen’t through a step-by-step guide and explained in detail how to use the various API endpoints in order to perform topic detection over audio and video files.
Cover Image Credits: Photo by Volodymyr Hryshchenko on Unsplash
Top comments (0)