Giorgos Myrianthous

Posted on Feb 11, 2022 • Originally published at towardsdatascience.com

How to Perform Speech-to-Text and Topic Detection with Python

#python #machinelearning #speechrecognition #programming

Introduction

In one of my recent articles, I discussed about Speech Recognition and how to implement it in Python. In today’s article we will go a step further and explore how to perform topic detection over audio and video files.
As an example let’s consider podcasts which are becoming more and more popular over time. Imagine how many podcasts are created on a daily basis; Wouldn’t be useful for recommendation engines on different platforms (such as Spotify, YouTube or Apple Podcasts) to somehow categorise all these podcasts based on the content discussed?

Performing Speech-to-Text and Topic Detection with Python

In this tutorial, we will be using AssemblyAI API in order to label topics that are spoken in audio and video files. Therefore, if you want to follow along you first need to obtain an AssemblyAI access token (which is absolutely free) that we will be using when calling the API.
Now that we have an access token, let’s start by preparing the headers that will be using when sending requests to the various endpoints for AssemblyAI.

import requests

API_KEY = <your AssemblyAI API key goes here>

# Create the headers for request
headers = {
    'authorization': API_KEY, 
    'content-type': 'application/json'
}

Going forward, we then need to upload our audio (or video) file to the hosting service of AssemblyAI. The endpoint is then going to return as the URL of the uploaded file that we will be using in subsequent requests.

AUDIO_FILE = '/path/to/your/audio/file.mp3'
UPLOAD_ENDPOINT = 'https://api.assemblyai.com/v2/upload'


def read_audio_file(file):
    """Helper method that reads in audio files"""
    with open(file, 'rb') as f:
        while True:
            data = f.read(5242880)
            if not data:
                break
            yield data

res_upload = requests.post(
    UPLOAD_ENDPOINT, 
    headers=headers, 
    data=read_audio_file(AUDIO_FILE)
)
upload_url = res_upload.json()['upload_url']

"""
Example response from AssemblyAI upload endpoint
pprint(res_upload.json())
{'upload_url': 'https://cdn.assemblyai.com/upload/b017e8c0-b31a-4d09-9dc2-8dee0ee0d3c8'}
"""

Now the next step is the most interesting part where we will be performing Speech-to-Text over the uploaded audio file. All we need to pass in the POST request is the audio_url that we received from the previous step along with iab_categories parameter that needs to be set to True. The latter is going to trigger topic detection over the text transcription. An example response from the TRANSCRIPT_ENDPOINT is also shown as a comment at the end of the code block below.

TRANSCRIPT_ENDPOINT = 'https://api.assemblyai.com/v2/transcript'

res_transcript = requests.post(
    TRANSCRIPT_ENDPOINT,
    headers=headers,
    json={
        'audio_url': upload_url,
        'iab_categories': True,
    }, 
)
res_transcript_json = res_transcript.json()

"""
Example response from transcript endpoint
print(res_transcript_json)
{
    'id': 'w9w13r544-8459-4b06-8d7a-1e9accee4b61', 
    'language_model': 'assemblyai_default', 
    'acoustic_model': 'assemblyai_default', 
    'language_code': 'en_us', 
    'status': 'queued', 
    'audio_url': 'https://cdn.assemblyai.com/upload/4eb47686-249d-4d48-9b79-62aea715d735', 
    'text': None, 
    'words': None, 
    'utterances': None, 
    'confidence': None, 
    'audio_duration': None, 
    'punctuate': True, 
    'format_text': True, 
    'dual_channel': None, 
    'webhook_url': None, 
    'webhook_status_code': None, 
    'speed_boost': False, 
    'auto_highlights_result': None, 
    'auto_highlights': False, 
    'audio_start_from': None, 
    'audio_end_at': None, 
    'word_boost': [], 
    'boost_param': None, 
    'filter_profanity': False, 
    'redact_pii': False, 
    'redact_pii_audio': False, 
    'redact_pii_audio_quality': None, 
    'redact_pii_policies': None, 
    'redact_pii_sub': None, 
    'speaker_labels': False, 
    'content_safety': False, 
    'iab_categories': True, 
    'content_safety_labels': {}, 
    'iab_categories_result': {}
}
"""

Now in order to get the transcription result (along with the topic detection results) we need to make one more request. This is because the transcription is asynchronous — when a file is submitted for transcription it will need some time until we can access the result (typically in about 15–30% time of the overall audio file duration).
Therefore, we need to make a few GET requests until we get successful (or failure) response as illustrated below.

import os
import sys
from time import sleep


status = ''
while status != 'completed':    
    res_result = requests.get(
        os.path.join(TRANSCRIPT_ENDPOINT, res_transcript_json['id']),
        headers=headers
    )
    status = res_result.json()['status']
    print(f'Status: {status}')

    if status == 'error':
        sys.exit('Audio file failed to process.')
    elif status != 'completed':
        sleep(10)

Finally, let’s write the received result into a text file so that it will be easier for us to inspect the output and interpret the response received from the transcription endpoint:

OUTPUT_TRANSCRIPT_FILE = 'speech-to-text-tutorial.txt'

with open(OUTPUT_TRANSCRIPT_FILE, 'w') as f:
    f.write(res_result.json()['text'])

print(f'Transcript file saved under {OUTPUT_TRANSCRIPT_FILE}')

Interpreting the response

An example response from the transcription endpoint is shown below:

{
    ...
    "id": "audio-transcription-id",
    "status": "completed",
    "text": "Ted Talks are recorded live at Ted Conference..."
    "iab_categories_result": {
        "status": "success",
        "results": [
            {
                "text": "Ted Talks are recorded live at Ted Conference...",
                "labels": [
                    {
                        "relevance": 0.02298230677843094,
                        "label": "Technology&Computing>Computing>ComputerSoftwareAndApplications>WebConferencing"
                    },
                    {
                        "relevance": 0.00561910355463624,
                        "label": "Education>OnlineEducation"
                    },
                    {
                        "relevance": 0.00465833256021142,
                        "label": "MusicAndAudio>TalkRadio"
                    },
                    {
                        "relevance": 0.002487020567059517,
                        "label": "Hobbies&Interests>ContentProduction"
                    },
                    {
                        "relevance": 0.0012438222765922546,
                        "label": "BusinessAndFinance>Business>ExecutiveLeadership&Management"
                    },
                    {
                        "relevance": 0.0010610689641907811,
                        "label": "Technology&Computing>Computing>Internet>SocialNetworking"
                    },
                    {
                        "relevance": 0.0008706427761353552,
                        "label": "Careers>RemoteWorking"
                    },
                    {
                        "relevance": 0.0005944414297118783,
                        "label": "Religion&Spirituality>Spirituality"
                    },
                    {
                        "relevance": 0.00039072768413461745,
                        "label": "Television>RealityTV"
                    },
                    {
                        "relevance": 0.00036419558455236256,
                        "label": "MusicAndAudio>TalkRadio>EducationalRadio"
                    }
                ],
                "timestamp": {
                    "start": 8630,
                    "end": 32990
                }
            },
            ...
        ],
        "summary": {
            "MedicalHealth>DiseasesAndConditions>BrainAndNervousSystemDisorders": 1.0,
            "FamilyAndRelationships>Dating": 0.7614801526069641,
            "Shopping>LotteriesAndScratchcards": 0.6330153346061707,
            "Hobbies&Interests>ArtsAndCrafts>Photography": 0.6305723786354065,
            "Style&Fashion>Beauty": 0.5269057750701904,
            "Education>EducationalAssessment": 0.49798518419265747,
            "BooksAndLiterature>ArtAndPhotographyBooks": 0.45763808488845825,
            "FamilyAndRelationships>Bereavement": 0.45646440982818604,
            "FineArt>FineArtPhotography": 0.3921416699886322,
            "NewsAndPolitics>Politics>Elections": 0.3911418318748474,
            "Technology&Computing>ConsumerElectronics>CamerasAndCamcorders": 0.37802764773368835,
            "Technology&Computing>ArtificialIntelligence": 0.3659703731536865,
            "PopCulture>CelebrityScandal": 0.30767935514450073,
            "FamilyAndRelationships": 0.30298155546188354,
            "Education>EducationalAssessment>StandardizedTesting": 0.2812648415565491,
            "Sports>Bodybuilding": 0.2398379147052765,
            "Education>HomeworkAndStudy": 0.20159155130386353,
            "Style&Fashion>BodyArt": 0.19066567718982697,
            "NewsAndPolitics>Politics>PoliticalIssues": 0.18915779888629913,
            "FamilyAndRelationships>SingleLife": 0.15354971587657928
        }
    },
}

The outer text key contains the result of the text transcription over the input audio file. But let’s focus more on the content of categories_iab_result that contains information relevant to the Topic Detection result.

status: Contains the status of the topic detection. Normally, this will be success. If for any reason the Topic Detection model has failed the value will then be unavailable.
results: This key will include a list of topics that were detected over the input audio file, including the precise text that influenced the prediction and triggered the prediction model to make this decision. Additionally, it includes some metadata about relevance and timestamps. We will discuss about both below.
results.text: This key includes the precise transcription text for the portion of audio that has been classified with a particular topic label.
results.timestamp: This key indicates the starting and ending time (recorded in milliseconds) for where the results.text was spoken in the input audio file.
results.labels: This is a list containing all the labels that were predicted by the Topic Detection model for the portion of text in results.text. The relevance key corresponds to a score that can take any value between 0 and 1.0 and indicates how relevant each predicted label in relation to results.text.
summary: For every unique label detection by the Topic Detection model in the results array, the summary key will include the relevancy for that label across the entire length of the input audio file. For example, if the Science>Environment label is detected only once in a 60-minute long audio file, the summary key will include a relatively low relevancy score for that label, since the entire transcription was not found to be consistently relevant to that topic label.

In order to see the full list of topic labels that the Topic Detection model is capable of predicting, make sure to check the relevant section in the official documentation.

Full Code

The full code used as part of this tutorial is shared below:

import os
import sys
import requests
from time import sleep


API_KEY = <your AssemblyAI API key goes here>
AUDIO_FILE = '/path/to/your/audio/file.mp3'
UPLOAD_ENDPOINT = 'https://api.assemblyai.com/v2/upload'
TRANSCRIPT_ENDPOINT = 'https://api.assemblyai.com/v2/transcript'
OUTPUT_TRANSCRIPT_FILE = 'speech-to-text-tutorial.txt'


def read_audio_file(file):
    """Helper method that reads in audio files"""
    with open(file, 'rb') as f:
        while True:
            data = f.read(5242880)
            if not data:
                break
            yield data


# Create the headers for request
headers = {
    'authorization': API_KEY, 
    'content-type': 'application/json'
}


res_upload = requests.post(
    UPLOAD_ENDPOINT, 
    headers=headers, 
    data=read_audio_file(AUDIO_FILE)
)
upload_url = res_upload.json()['upload_url']


res_transcript = requests.post(
    TRANSCRIPT_ENDPOINT,
    headers=headers,
    json={
        'audio_url': upload_url,
        'iab_categories': True,
    }, 
)
res_transcript_json = res_transcript.json()


status = ''
while status != 'completed':    
    res_result = requests.get(
        os.path.join(TRANSCRIPT_ENDPOINT, res_transcript_json['id']),
        headers=headers
    )
    status = res_result.json()['status']
    print(f'Status: {status}')

    if status == 'error':
        sys.exit('Audio file failed to process.')
    elif status != 'completed':
        sleep(10)


with open(OUTPUT_TRANSCRIPT_FILE, 'w') as f:
    f.write(res_result.json()['text'])

print(f'Transcript file saved under {OUTPUT_TRANSCRIPT_FILE}')

Final Thoughts

In today’s article we explored how to perform Speech-to-Text and Topic Detection over the generated text transcription using Python and AssemblyAI API. We wen’t through a step-by-step guide and explained in detail how to use the various API endpoints in order to perform topic detection over audio and video files.

Cover Image Credits: Photo by Volodymyr Hryshchenko on Unsplash

DEV Community

How to Perform Speech-to-Text and Topic Detection with Python

Introduction

Performing Speech-to-Text and Topic Detection with Python

Interpreting the response

Full Code

Final Thoughts

Top comments (0)

Read next

Migrating from Azure Database for PostgreSQL to Neon

The Rise of AI-Driven Web Development

Enhancing Generative AI with Persistent Memory

Building SaaS Faster with Ercas for SaaS: A Template for Indie Hackers