Matt Grofsky

Posted on Nov 9, 2020

Analyze Your Call Recordings With Google AI

#machinelearning #googlecloud

For most companies, the story usually goes like this.

A customer calls in to complain, praise, or ask for assistance.
The call is recorded for further training or evaluation.
The recording is typically picked at random, listened to by someone, and reviewed with the customer service representative.

This process can take anywhere from an hour to a week after a customer hangs up. During this time, a lot can go wrong. Compliance issues and poor service could leave you with some unhappy customers.

I’ll show you how to work smarter, not harder, and identify problems as soon as they occur. What most developers don’t realize is that the intricate pieces pre-built inside the Google Cloud Platform.

There are three essential items you will want to look for when evaluating a call.

Identity — Separate the individuals on the call distinctly.
Sentiment — Are these individuals generally positive or negative in the interaction.
Trigger Words — Were any words or phrases said that warrant further review.

Let’s complicate this a bit and evaluate single-channel audio phone calls. Complexity means we are not only dealing with call quality type audio, but also audio where each caller co-mingles in a single channel. Single channels make it much harder to distinguish who is talking and when.

A Google Cloud Function is the easiest way to trigger code execution at scale when a file is uploaded to Cloud Storage. Setting up a Cloud Function for this purpose is easy and straight forward.

Let’s first start with the requirements.txt file and imports.

Requirements.txt

google-cloud-speech==1.3.2
google-cloud-storage==1.27.0
pathlab

imports

In this example, I will be using diarization to distinguish and separate the audio between the two callers. Diarization is:

The process of partitioning an input audio stream into homogeneous segments according to the speaker identity

This process requires Cloud Speech beta module speech_v1p1beta1.

import os
import requests
import json
import sys
import time
import uuid
from google.cloud import speech_v1p1beta1
from google.cloud.speech_v1p1beta1 import enums
from google.cloud import storage

Identifying the created file

As the Cloud Function is triggered by a google.storage.object.finalize event inside GCS, a dictionary with data specific to this type of event is sent.

Grabbing the path of the file name is as easy as pulling out the object file[‘name’] from the [dictionary] (https://cloud.google.com/functions/docs/calling/storage). Knowing all this information, we can build out a gs:// URI that can be used for various Google AI services.

BucketName = 'gcs-bucket'

def transcribe_audio(event, context):
    file = event
    now = time.time()
    FileName = file['name']
    storage_uri = 'gs://' + BucketName + '/' + FileName

Transcribing the Audio

Before transcribing the audio, I first want to make sure it is an actual audio file. In this example, I am only going to deal with mp3 audio. There are a tremendous amount of options to choose from, and I will highlight a few. First, the hertz rate is essential, and more often than not, is 8000 for phone audio recordings. Second, because this is a phone call, it is different. Google has a different Machine Learning model for phone call audio that creates a better transcription overall. Finally, for proper configuration, make sure to enable diarization and set the appropriate amount of speakers on the call. If required, auto-adjust your utterance dictionary and pick out specific pronouns, business names, or phrases that can show up in conversation.

    # Let's process only mp3 files
    if storage_uri[-4:] ==".mp3":
        client = speech_v1p1beta1.SpeechClient()

    # Sample rate in Hertz of the audio data sent
        sample_rate_hertz = 8000

    # The language of the supplied audio
        language_code = "en-US"
        model = "phone_call"

    # Encoding of audio data sent. This sample sets this explicitly.
    # This field is optional for FLAC and WAV audio formats.
        encoding = enums.RecognitionConfig.AudioEncoding.MP3
        config = {
            "sample_rate_hertz": sample_rate_hertz,
            "language_code": language_code,
            "encoding": encoding,
            "model": model,
            "use_enhanced": True,
            "enable_automatic_punctuation": True,
            "enable_speaker_diarization": True,
            "diarization_speaker_count": 2,
            "speech_contexts": [{
                "phrases": ["Thank you for calling ABC", 
                "Thank you for contacting ABC",
                "Welcome to ABC",
                "ABC customer service",
                "Thank you for calling ABC customer support."]
                }]
        }
        audio = {"uri": storage_uri}

    operation = client.long_running_recognize(config, audio)

    #print(u"Waiting for operation to complete...")
        response = operation.result()
        transcript = ""
        transcriptw = ""
        sendtrans = False
        keyword = "Empty Audio"
        speaker = ""

    for result in response.results:
        words_info = result.alternatives[0].words
        for word_info in words_info:
            if str(word_info.speaker_tag) != "0":
                if str(word_info.speaker_tag) != str(speaker):
    #print(str(word_info.speaker_tag) + " is not " + str(speaker))
                    speaker = str(word_info.speaker_tag)
                    transcriptw = transcriptw + "\n-------\n*Speaker " + speaker + ":* " + word_info.word
                 else:
    #print(str(word_info.speaker_tag) + " is " + speaker)
                    transcriptw = transcriptw + " " + word_info.word
                    speaker = str(word_info.speaker_tag)

    sendtrans = False
    keyword = "Empty Audio"
    print(transcriptw)

    if transcriptw.strip() == "":
        transcriptw = "*No Sound*"
        sendtrans = True
    else:
        list = ["bitcoin","payment", "invoice", "bill", "utilities", "utility", "electricity", "credit card", "package", "testing","kits","financial", "supplies", "mask", "symptoms", "isolate","oxygen","ventilator","social security","government","internal revenue","covid", "world health", "national institute", "virus", "corona","quarantine","stimulus","relief","cdc","disease","pandemic","epidemic","sickness"] 
        # Using for loop 
        for i in list: 
            if i.lower() in transcriptw.lower():
                keyword = i.lower()
                sendtrans = True
                break

    if sendtrans == True:
            print(f"Sending to Slack: {file['name']}.")
            filename = file['name']
                send_slack(transcript.strip(),filename,keyword)

For Longer audio such as entire phone conversations, the best practice is to use the client.long_running_recognize(config, audio) method. This method performs asynchronous speech recognition.

After transcribing, I check the transcript for any keyword triggers and, if any match, send the transcription to slack for immediate notification.

Below is the slack function

def send_slack(transcript,filename,keyword):
    try:
        response = requests.post(url="https://hooks.slack.com/services/ABCDEFG/123456/ABC123",
            headers={
                "Content-Type": "application/json",
            },
            data=json.dumps({
            "text": "*Audio:* https://storage.cloud.google.com/" + BucketName + "/" + filename + "\n*Transcription:*\n" + transcript 
        })
        )
        print('Response HTTP Status Code: {status_code}'.format(
            status_code=response.status_code))
        print('Response HTTP Response Body: {content}'.format(
            content=response.content))
    except requests.exceptions.RequestException:
        print('HTTP Request failed')

An open-source and simplified example of the above code is in one of Ytel’s public Gitlab repositories.

Telecom companies quickly needed to identify and report certain types of scam oriented communications when the Covid-19 outbreak started.

DEV Community

Analyze Your Call Recordings With Google AI

Top comments (0)

Read next

LangGraph + Corrective RAG + Local LLM = Powerful Rag Chatbot

Efficiently Managing and Querying Visual Data With MongoDB Atlas Vector Search and FiftyOne

Publishing your Google App: CASA Tier 2 certification

Running Local LLMs, CPU vs. GPU - a Quick Speed Test