DEV Community

Cover image for Let's Make Python Listen - Part 1.
Mahmoud Harmouch
Mahmoud Harmouch

Posted on • Updated on

Let's Make Python Listen - Part 1.

Hello, fellow human being. In this series of articles, we are going to unravel the mysterious world of speech recognition systems and utilize Deepgram's services in this context. Many people may be interested in this subject matter on the grounds that many voice assistants are competing to quickly become the dominant smart speaker, such as Amazon's Alexa, Google's Assistant, Apple's Siri, that make use of different types of deep neural network(feedforward network and feedback networks). Deep neural networks were introduced in 2006 [0] by the godfather himself: Geoffrey Hinton [1]

๐Ÿ‘‰ Table Of Content (TOC).

What is Speech?

๐Ÿ” Go To TOC.

The human voice is a physical phenomenon that we cannot see. The shape of the back of the throat and its vibration are used to make a speech sound [2]. When a microphone picks up sounds, it converts them into an electrical signal that can be transmitted over a wired or wireless connection to software on your computer, speakers, or a voice-recognition device. The brain initiate speech by triggering your mouth muscles to produce sound [3]. For example, when someone speaks the word "Hello," they articulate it with lips and tongue while their vocal cords vibrate and air passes between them.

What is Speech Recognition?

๐Ÿ” Go To TOC.

Speech recognition is the process of converting spoken words into text. In some cases, it can be used in conjunction with other technologies to provide computer input or replace the keyboard and mouse. It's a technology that has been around ever since the 1950s, but we have seen significant advancements in recent years. Speech recognition utilizes DSP's (digital signal processing) techniques to process and analyzes audio signals [4].

Speech recognition is often used as a stand-alone application or part of a larger software package that includes other features, such as dictation. It allows the user to control a computer or other device by speaking. It is also known by a variety of terms such as voice recognition, voice to text, speech-to-text, or speech recognition,

Having a brief introduction to speech recognition, Now let's take a look at the exciting history of speech recognition, which is surprisingly enough, ages around 72 years old starting from the 1950s, as mentioned above.

Speach Recognition History

๐Ÿ” Go To TOC.


Bell-Laboratories-invented-Audrey [5]

In the initial decade of the fiftieth century, scientists in the Bell System created the Audrey(Automatic Digit Recognizer) machine, which has three main components:

  • A microphone that captures human speech.
  • A piece of hardware that was programmed to do the actual transcription.
  • A display that shows the number being spoken into the microphone(right-hand side of the image)

As the name suggests, this machine can recognize digits(0-9).


The Shoebox [6]

In 1962, IBM released the first device called Shoebox [7] to recognize spoken words; It can realize ten digits and six arithmetical words command(e.g., plus, minus, etc.). For example, if someone says 2 plus 2 through the microphone, Shoebox would trigger an adding function to calculate and display the result.

These technologies worked back then by transforming voice signals into electrical impulses, and then each word was split into small Phonetic Units. For example, the term "hello" would be divided into hello 'he l oh' or something along this line.

In the 1970s, the US Department of Defense stepped in financially support research. DARPA (Defense Advanced Research Projects Agency, the same agency that got allegedly exposed for facilitating biological experiments related to s@rs-c0v2 [8]. Damn, dude. All those conspiracy theories were true all the time.) funded one of the most significant speeches recognition projects. The result was to recognize more than a thousand words.

In 1982, SAM synthesizer [9] was the first commercial speech synthesis software giving voice to Commodore 64 computer 1982.

A significant milestone was achieved in the late 1980s when statistical-based models were introduced(e.g., the Hidden Markov Model.), which can recognize approximately five thousand words.


A hidden Markov model for speech recognition. [10]

It works by assigning each letter to a node with a probability of predicting the following letter in the word that represents the edge. As you can see in the example below, the term 'potato' can be pronounced in various ways, such as 'p oh t ah t oh', 'p ah t ay t oh', and others.


Speech Recognition and Statistical Modeling. [11]

The downside of these algorithms is that they only recognize discrete speech, so you cannot speak naturally; you need to pause between words which is unfortunate.

In the 1990s, the first commercial product became available for the masses when Dragon launched its product called Dragon Dictate, which is capable of recognizing approximately 60k words.

Entering the 2000s, Google released the voice search app for iPhone [12]. The app processes voice requests based on Google's cloud data center, matching them with a large pool of human-speech recordings and learning from queries collected from the users(230 billion words) trained by neural networks that got introduced in 2006, as mentioned at the beginning of the article.

I think that is enough history for today, which presumably will be continued in future posts about speech recognition. Now let's move on to the next section exploring Deepgram transcription services.

What is a Deepgram?

๐Ÿ” Go To TOC.

Deepgram is a new promising AI-powered transcription tool that utilizes deep learning and machine learning algorithms to transcribe audio recordings by detecting words and phrases that occur within the recording. In simple terms, it is a voice recognition service that takes recordings and converts them into text. But, it is much more than that.

Apparently, Deepgram has many use cases. For example, it can be used as a transcription service for meetings, and phone calls, as a speech-to-text service for videos, or as an automated transcript for audio files. Detailed information is available on their website [13].

Deepgram's Unique Features

๐Ÿ” Go To TOC.


High Accuracy for Better Speech Analysis. [14]

Deepgram has been shown [15] to offer significantly higher accuracy rates(90%+ accuracy) than other translation systems out there. In addition, it also provides a much higher transcription speed than other systems(3 seconds to transcribe hour-long recordings) and lower costs(0.78$/hour), which makes it an attractive option for businesses that need to transcribe large quantities of content regularly.

When writing this article, this service supports most languages with a large variety of accents and dialects that can identify and transcribe audio across 16 languages [16].

The cool part about Deepgram is that it offers a free trial which anyone can use. Moreover, Deepgram provides open-source SDKs and free speech recognition tools that can be integrated into any application or system.

With the help of Deepgram, we don't have to reinvent the wheel and build a machine learning model from the bottom up(that would be a fantastic project to work on in the future.). Instead, we will use the Python SDK, which allows us to interact with various deepgram API endpoints that utilize the state-of-the-art machine learning model to perform speech transcription.

In essence, Deepgram transcription services are easy to use, accurate and fast. It can help you save time, money, and resources while still providing high-quality content.

Now, let's jump into the technical stuff.

Speech Recognition from a Live Microphone

๐Ÿ” Go To TOC.

In this section, we will learn how to convert real-time speech into human-readable text. To accomplish this, we will use the deepgram-SDK along with the PyAudio package.

Install deepgram-sdk, pyaudio

๐Ÿ” Go To TOC.

Python has a handy built-in module called wave, but it does not support recording, just processing audio files on the fly. To record audio data, we can consult a third-party package called PyAudio. The official website is a good starting point on how to install and use this library on various platforms.

However, PyAudio depends on another library called portaudio, which is not part of the default Linux dependencies. To install it on your machine, you need to issue the following command on your terminal:

$ sudo apt-get install portaudio19-dev
Enter fullscreen mode Exit fullscreen mode

If the above command runs successfully, you can download and install pyaudio on your system. However, Because we previously used poetry instead of the pip for dependency management, we can run the following command to import PyAudio into our project:

$ poetry add pyaudio
Enter fullscreen mode Exit fullscreen mode

If the installation part was successful, you could look up the portaudio version by running:

$ python3 -c 'import pyaudio as p; print(p.get_portaudio_version())'
1246720
Enter fullscreen mode Exit fullscreen mode

To install deepgram on your machine, you can follow along with their GitHub repo. Likewise, to import deepgram into our project with poetry, simply run:

$ poetry add deepgram-sdk
Enter fullscreen mode Exit fullscreen mode

If the installation part was successful, you could look up the deepgram version by running:

$ python3 -c 'import deepgram; print(deepgram._version.__version__)'
0.2.5
Enter fullscreen mode Exit fullscreen mode

Now, it is time to play with these modules. To do so, make sure your microphone is on by default and not muted.

Input and Output Devices

๐Ÿ” Go To TOC.

Now, let's open up a REPL and test things out.

We will begin by importing the pyaudio module and then instantiating the PyAudio class.

>>> import pyaudio
>>> py_audio = pyaudio.PyAudio()
Enter fullscreen mode Exit fullscreen mode

If you are on linux, you may run into the following warnings:

ALSA lib pcm_dmix.c:1089:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_route.c:869:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_oss.c:377:(_snd_pcm_oss_open) Unknown field port
ALSA lib pcm_oss.c:377:(_snd_pcm_oss_open) Unknown field port
ALSA lib pcm_usb_stream.c:486:(_snd_pcm_usb_stream_open) Invalid type for card
ALSA lib pcm_usb_stream.c:486:(_snd_pcm_usb_stream_open) Invalid type for card
ALSA lib pcm_dmix.c:1089:(snd_pcm_dmix_open) unable to open slave
Enter fullscreen mode Exit fullscreen mode

Let's ignore these warnings for now.

py_audio has a lot of valuable attributes that you can use to get information about your input and output devices.

>>> for attr in dir(py_audio):
...   if not attr.startswith("_"):
...     print(attr)
... 
close
get_default_host_api_info
get_default_input_device_info
get_default_output_device_info
get_device_count
get_device_info_by_host_api_device_index
get_device_info_by_index
get_format_from_width
get_host_api_count
get_host_api_info_by_index
get_host_api_info_by_type
get_sample_size
is_format_supported
open
terminate
Enter fullscreen mode Exit fullscreen mode

For instance, to look up details about the default input device, you can call the following method:

>>> py_audio.get_default_input_device_info()
{
    'index': 9,
    'structVersion': 2,
    'name': 'default',
    'hostApi': 0,
    'maxInputChannels': 32,
    'maxOutputChannels': 32,
    'defaultLowInputLatency': 0.008684807256235827,
    'defaultLowOutputLatency': 0.008684807256235827,
    'defaultHighInputLatency': 0.034807256235827665,
    'defaultHighOutputLatency': 0.034807256235827665,
    'defaultSampleRate': 44100.0
}
Enter fullscreen mode Exit fullscreen mode

Keep in mind the value of the defaultSampleRate key. We are going to use it when recording audio from the microphone.

Similarly, to get information about your default input device, you can call the following method:

>>> py_audio.get_default_output_device_info()
{
    'index': 9,
    'structVersion': 2,
    'name': 'default',
    'hostApi': 0,
    'maxInputChannels': 32,
    'maxOutputChannels': 32,
    'defaultLowInputLatency': 0.008684807256235827,
    'defaultLowOutputLatency': 0.008684807256235827,
    'defaultHighInputLatency': 0.034807256235827665,
    'defaultHighOutputLatency': 0.034807256235827665,
    'defaultSampleRate': 44100.0
}
Enter fullscreen mode Exit fullscreen mode

If you want to check the details of every I/O device on your machine, you can execute the following code:

>>> for index in range(py_audio.get_device_count()):
...   device_info = py_audio.get_device_info_by_index(index)
...   for key, value in device_info.items():
...     print(key, value, sep=": ")
Enter fullscreen mode Exit fullscreen mode

Audio Recording & Wave Files

๐Ÿ” Go To TOC.

Experimentations

To record audio data from the microphone, you need to call the open method:

>>> from rich import inspect
>>> inspect(py_audio.open)
โ•ญโ”€ <bound method PyAudio.open of <pyaudio.PyAudio object at 0x7f6c8bed5180>> โ”€โ•ฎ
โ”‚ def PyAudio.open(*args, **kwargs):                                          โ”‚
โ”‚                                                                             โ”‚
โ”‚ Open a new stream. See constructor for                                      โ”‚
โ”‚ :py:func:`Stream.__init__` for parameter details.                           โ”‚
โ”‚                                                                             โ”‚
โ”‚ 27 attribute(s) not shown. Run inspect(inspect) for options.                โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
Enter fullscreen mode Exit fullscreen mode

We are going to use rich for proper message display. Now, let's create a stream object for recording purposes:

>>> # open stream object as input & output
>>> audio_stream = py_audio.open(
    rate=44100,             # frames per second, 
    channels=1,             # mono, change to 2 if you want stereo
    format=pyaudio.paInt16, # sample format, 8 bytes. see inspect
    input=True,             # input device flag
    output=False,            # output device flag, if True, you can play back the audio. 
    frames_per_buffer=1024  # 1024 samples per frame
)
Enter fullscreen mode Exit fullscreen mode

Now, You can take a look at the available attributes for this stream object.

>>> for attr in dir(audio_stream):
...   if not attr.startswith("_"):
...     print(attr)
... 
close
get_cpu_load
get_input_latency
get_output_latency
get_read_available
get_time
get_write_available
is_active
is_stopped
read
start_stream
stop_stream
write
Enter fullscreen mode Exit fullscreen mode

The read and write functions are the most useful functions for this tutorial. We can call the read function to record audio samples in terms of frames.

>>> inspect(audio_stream.read)
โ•ญโ”€ <bound method Stream.read of <pyaudio.Stream object at 0x7f8310a41180>> โ”€โ•ฎ
โ”‚ def Stream.read(num_frames, exception_on_overflow=True):                  โ”‚
โ”‚                                                                           โ”‚
โ”‚ Read samples from the stream.  Do not call when using                     โ”‚
โ”‚ *non-blocking* mode.                                                      โ”‚
โ”‚                                                                           โ”‚
โ”‚ 27 attribute(s) not shown. Run inspect(inspect) for options.              โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
Enter fullscreen mode Exit fullscreen mode

Apparently, the read method accepts frames number instead of duration. Therefore, we need to convert a duration, a given period of time to record data, to a frames number. To do so, we need to find how many frames are there in a given duration. The following formula will do the trick:

num_frames = int(rate / samples_per_frame * duration)
Enter fullscreen mode Exit fullscreen mode

We can make sure that the above formula is correct using dimensional analysis:

the unit of mesurement for:

  • rate: samples/second
  • samples_per_frame: samples/frames
  • duration: second

The value on the left-hand side of the equation num_frames should have a unit in frames which is the case of our formula if you do the math. Now we can iterate through all the frames and read 1024 samples per frame. The int function was used to round down the result towards the nearest integer.

>>> frames = []
>>> for _ in range(int(44100 / 1024 * 3)):
...   data = audio_stream.read(1024)
...   frames.append(data)
... 
>>> len(frames)
129
Enter fullscreen mode Exit fullscreen mode

Each frame being added is a stream of bytes:

>>> type(frames[0])
<class 'bytes'>
Enter fullscreen mode Exit fullscreen mode

Now, let's store this object into a wav file to confirm that it is indeed a 3-second worth of recordings. To do so, let's import the built-in wave module:

>>> import wave
Enter fullscreen mode Exit fullscreen mode

Let's see what the available attributes for this object are:

>>> for attr in dir(wave):
...   if not attr.startswith("_"):
...     print(attr)
... 
Chunk
Error
WAVE_FORMAT_PCM
Wave_read
Wave_write
audioop
builtins
namedtuple
open
struct
sys
Enter fullscreen mode Exit fullscreen mode

As you may guess, we are going to use the open function to open a file in write mode.

>>> wave_file = wave.open("sound.wav", "wb")
Enter fullscreen mode Exit fullscreen mode

Similarly, let's see all the attributes for this object:

>>> for attr in dir(wave_file):
...   if not attr.startswith("_"):
...     print(attr)
... 
close
getcompname
getcomptype
getframerate
getmark
getmarkers
getnchannels
getnframes
getparams
getsampwidth
initfp
setcomptype
setframerate
setmark
setnchannels
setnframes
setparams
setsampwidth
tell
writeframes
writeframesraw
Enter fullscreen mode Exit fullscreen mode

Since we are going to write into a file, then we have to use either writeframes or writefranmesraw. Go to the official documentation. You will realize that the writeframes function has more logic involved than the writeframesraw because it checks for several writing frames in the file. Thus, we will use this function for this tutorial.

But first, we need to set some parameters for the wave_file object:

>>> wave_file.setnchannels(2)
>>> wave_file.setsampwidth(py_audio.get_sample_size(pyaudio.paInt16))
>>> wave_file.setframerate(44100)
Enter fullscreen mode Exit fullscreen mode

Now, everything is set up; you can write the stream of data into the file:

>>> wave_file.writeframes(b"".join(frames))
>>> wave_file.close()
Enter fullscreen mode Exit fullscreen mode

Having experimented with the wave and pyaudio modules, let's put it all together.

Putting it All Together

๐Ÿ” Go To TOC.

There are two approaches you can bundle together the previous code, either using a functional programming approach or object-oriented programming.

Functional Programming
๐Ÿ” Go To TOC.
import wave
from typing import List, Optional, TypeVar, Union, IO

import pyaudio #  type: ignore

WaveWrite = TypeVar("WaveWrite", bound=wave.Wave_write)


def init_recording(
    file_name: Union[str, IO[bytes]] = "sound.wav", mode: Optional[str] = "wb"
) -> WaveWrite:
    wave_file = wave.open(file_name, mode)
    wave_file.setnchannels(2)
    wave_file.setsampwidth(2)
    wave_file.setframerate(44100)
    return wave_file


def record(wave_file: WaveWrite, duration: Optional[int] = 3) -> None:
    py_audio = pyaudio.PyAudio()
    audio_stream = py_audio.open(
        rate=44100,  # frames per second,
        channels=2,  # stereo, change to 1 if you want mono
        format=8,  # sample format, 8 bytes. see inspect
        input=True,  # input device flag
        frames_per_buffer=1024,  # 1024 samples per frame
    )
    frames = []
    for _ in range(int(44100 / 1024 * 3)):
        data = audio_stream.read(1024)
        frames.append(data)
    wave_file.writeframes(b"".join(frames))
    audio_stream.close()


if __name__ == "__main__":
    wave_file = init_recording() #  type: ignore
    record(wave_file)
    wave_file.close()
Enter fullscreen mode Exit fullscreen mode
Object Oriented Approach
๐Ÿ” Go To TOC.

As described in the docstrings below, I assumed that each field of the AudioRecorder class is private by default and only accessible through getters and setters. In python, it is not mandatory to use getters and setters, but I like to use this approach because I used to code in statical typed languages, mainly c# and Java.

Notice the use of the magic __attrs_post_init__ method that would set the wave_file attribute at the moment of instantiation after calling the __init__. I also used type hinting, as you can tell. In python, you are not required to do all of this, yet still an option. The __init__ is automatically generated using the atts module(notice each attribute has a define method).

This snippet of code was adapted from the audio_record module of the deepwordle project.

import os
import wave
from os import PathLike
from typing import IO, List, Optional, TypeVar, Union

import pyaudio  # type: ignore
from attrs import define, field

WaveWrite = TypeVar("WaveWrite", bound=wave.Wave_write)
BASE_DIR = os.path.dirname(os.path.abspath(__file__))


@define
class AudioRecorder:
    """
    A brief encapsulation of an audio recorder object attributes and methods.
    All fields are assumed to be private by default, and only accessible through
    getters/setters, but someone still could hack his/her way around it!
    Attrs:
        frames_per_buffer: An integer indicating the number of frames per buffer;
            1024 frames/buffer by default.
        audio_format: An integer that represents the number of bits per sample
            stored as 16-bit signed int.
        channels: An integer indicating how many channels a microphone has.
        rate: An integer indicating how many samples per second: frequency.
        py_audio: pyaudio instance.
        data_stream: stream object to get data from microphone.
        wave_file: wave class instance.
        mode: file object mode.
        file_name: file name to store audio data in it.
    """

    _frames_per_buffer: int = field(init=True, default=1024)
    _audio_format: int = field(init=True, default=pyaudio.paInt16)
    _channels: int = field(init=True, default=1)
    _rate: int = field(init=True, default=44100)
    _py_audio: pyaudio.PyAudio = field(init=False, default=pyaudio.PyAudio())
    _data_stream: IO[bytes] = field(init=False, default=None)
    _wave_file: wave.Wave_write = field(init=False, default=None)
    _mode: str = field(init=True, default="wb")
    _file_name: Union[str, PathLike[str]] = field(init=True, default="sound.wav")

    @property
    def frames_per_buffer(self) -> int:
        """
        A getter method that returns the value of the `frames_per_buffer` attribute.
        :param self: Instance of the class.
        :return: An integer that represents the value of the `frames_per_buffer` attribute.
        """
        if not hasattr(self, "_frames_per_buffer"):
            raise AttributeError(
                f"Your {self.__class__.__name__!r} instance has no attribute named frames_per_buffer."
            )
        return self._frames_per_buffer

    @frames_per_buffer.setter
    def frames_per_buffer(self, value: int) -> None:
        """
        A setter method that changes the value of the `frames_per_buffer` attribute.
        :param value: An integer that represents the value of the `frames_per_buffer` attribute.
        :return: NoReturn.
        """
        setattr(self, "_frames_per_buffer", value)

    @property
    def audio_format(self) -> int:
        """
        A getter method that returns the value of the `audio_format` attribute.
        :param self: Instance of the class.
        :return: A string that represents the value of the `audio_format` attribute.
        """
        if not hasattr(self, "_audio_format"):
            raise AttributeError(
                f"Your {self.__class__.__name__!r} instance has no attribute named audio_format."
            )
        return self._audio_format

    @audio_format.setter
    def audio_format(self, value: int) -> None:
        """
        A setter method that changes the value of the `audio_format` attribute.
        :param value: An integer that represents the value of the `audio_format` attribute.
        :return: NoReturn.
        """
        setattr(self, "_frames_per_buffer", value)

    @property
    def channels(self) -> int:
        """
        A getter method that returns the value of the `channels` attribute.
        :param self: Instance of the class.
        :return: An integer that represents the value of the `channels` attribute.
        """
        if not hasattr(self, "_channels"):
            raise AttributeError(
                f"Your {self.__class__.__name__!r} instance has no attribute named channels."
            )
        return self._channels

    @channels.setter
    def channels(self, value: int) -> None:
        """
        A setter method that changes the value of the `channels` attribute.
        :param value: An integer that represents the value of the `channels` attribute.
        :return: NoReturn.
        """
        setattr(self, "_channels", value)

    @property
    def rate(self) -> int:
        """
        A getter method that returns the value of the `rate`attribute.
        :param self: Instance of the class.
        :return: A string that represents the value of the `rate` attribute.
        """
        if not hasattr(self, "_rate"):
            raise AttributeError(
                f"Your {self.__class__.__name__!r} instance has no attribute named rate."
            )
        return self._rate

    @rate.setter
    def rate(self, value: int) -> None:
        """
        A setter method that changes the value of the `rate` attribute.
        :param value: An integer that represents the value of the `rate` attribute.
        :return: NoReturn.
        """
        setattr(self, "_rate", value)

    @property
    def py_audio(self) -> pyaudio.PyAudio:
        """
        A getter method that returns the value of the `py_audio`attribute.
        :param self: Instance of the class.
        :return: A PyAudio object that represents the value of the `py_audio` attribute.
        """
        if not hasattr(self, "_py_audio"):
            raise AttributeError(
                f"Your {self.__class__.__name__!r} instance has no attribute named py_audio."
            )
        return self._py_audio

    @py_audio.setter
    def py_audio(self, value: int) -> None:
        """
        A setter method that changes the value of the `py_audio` attribute.
        :param value: A PyAudio object that represents the value of the `py_audio` attribute.
        :return: NoReturn.
        """
        setattr(self, "_py_audio", value)

    @property
    def data_stream(self) -> IO[bytes]:
        """
        A getter method that returns the value of the `data_stream`attribute.
        :param self: Instance of the class.
        :return: A string that represents the value of the `data_stream` attribute.
        """
        if not hasattr(self, "_data_stream"):
            raise AttributeError(
                f"Your {self.__class__.__name__!r} instance has no attribute named data_stream."
            )
        return self._data_stream

    @data_stream.setter
    def data_stream(self, value: IO[bytes]) -> None:
        """
        A setter method that changes the value of the `data_stream` attribute.
        :param value: A string that represents the value of the `data_stream` attribute.
        :return: NoReturn.
        """
        setattr(self, "_data_stream", value)

    @property
    def wave_file(self) -> wave.Wave_write:
        """
        A getter method that returns the value of the `wave_file`attribute.
        :param self: Instance of the class.
        :return: A string that represents the value of the `wave_file` attribute.
        """
        if not hasattr(self, "_wave_file"):
            raise AttributeError(
                f"Your {self.__class__.__name__!r} instance has no attribute named wave_file."
            )
        return self._wave_file

    @wave_file.setter
    def wave_file(self, value: wave.Wave_write) -> None:
        """
        A setter method that changes the value of the `wave_file` attribute.
        :param value: A string that represents the value of the `wave_file` attribute.
        :return: NoReturn.
        """
        setattr(self, "_wave_file", value)

    @property
    def file_name(self) -> Union[str, PathLike[str]]:
        """
        A getter method that returns the value of the `file_name`attribute.
        :param self: Instance of the class.
        :return: A string that represents the value of the `file_name` attribute.
        """
        if not hasattr(self, "_mode"):
            raise AttributeError(
                f"Your {self.__class__.__name__!r} instance has no attribute named file_name."
            )
        return self._file_name

    @file_name.setter
    def file_name(self, value: Union[str, PathLike[str]]) -> None:
        """
        A setter method that changes the value of the `file_name` attribute.
        :param value: A string that represents the value of the `file_name` attribute.
        :return: NoReturn.
        """
        setattr(self, "_file_name", value)

    @property
    def mode(self) -> str:
        """
        A getter method that returns the value of the `mode`attribute.
        :param self: Instance of the class.
        :return: A string that represents the value of the `mode` attribute.
        """
        if not hasattr(self, "_mode"):
            raise AttributeError(
                f"Your {self.__class__.__name__!r} instance has no attribute named mode."
            )
        return self._mode

    @mode.setter
    def mode(self, value: str) -> None:
        """
        A setter method that changes the value of the `mode` attribute.
        :param value: A string that represents the value of the `mode` attribute.
        :return: NoReturn.
        """
        setattr(self, "_mode", value)

    def __repr__(self) -> str:
        attrs: dict = {
            "frames_per_buffer": self.frames_per_buffer,
            "audio_format": self.audio_format,
            "channels": self.channels,
            "rate": self.rate,
            "py_audio": repr(self.py_audio),
            "data_stream": self.data_stream,
            "wave_file": repr(self.wave_file),
            "mode": self.mode,
            "file_name": self.file_name,
        }
        return f"{self.__class__.__name__}({attrs})"

    def __attrs_post_init__(self) -> None:
        wave_file = wave.open(os.path.join(BASE_DIR, self.file_name), self.mode)
        wave_file.setnchannels(self.channels)
        wave_file.setsampwidth(self.py_audio.get_sample_size(self.audio_format))
        wave_file.setframerate(self.rate)
        self.wave_file = wave_file
        del wave_file

    def record(self, duration: int = 3) -> None:
        self.data_stream = self.py_audio.open(
            format=self.audio_format,
            channels=self.channels,
            rate=self.rate,
            input=True,
            output=True,
            frames_per_buffer=self.frames_per_buffer,
        )
        frames: List[bytes] = []
        num_frames: int = int(self.rate / self.frames_per_buffer * duration)
        for _ in range(num_frames):
            data = self.data_stream.read(self.frames_per_buffer)
            frames.append(data)
        self.wave_file.writeframes(b"".join(frames))

    def stop_recording(self) -> None:
        if self.data_stream:
            self.data_stream.close()
            self.py_audio.terminate()
            self.wave_file.close()


if __name__ == "__main__":
    rec = AudioRecorder()
    print(rec)
    rec.record()
    rec.stop_recording()

Enter fullscreen mode Exit fullscreen mode

Deepgram python sdk.

๐Ÿ” Go To TOC.

Let's go back to our REPL and start playing with the deepgram SDK.

We will start by importing the deepgram module and then instantiating a Deepgram instance.

>>> from deepgram import Deepgram
>>> for attr in dir(Deepgram):
...   if not attr.startswith("_"):
...     print(attr)
... 
keys
projects
transcription
usage
Enter fullscreen mode Exit fullscreen mode

As you can see, there are four main attributes in the Deepgram class. Using deepgram, you can transcribe pre-recorded audio or live audio streams like the bbc radio. You can follow along the Readme file to get information on setting up a deepgram account and to get things started. Having a secret key, you can interact with the API to do the transcription. Once you get the API key, you need to store it in an environment variable to get the following code running successfully:

$ export DEEPGRAM_API_KEY="XXXXXXXXX"
Enter fullscreen mode Exit fullscreen mode
from deepgram import Deepgram # type:  ignore
import asyncio
import os
from os import PathLike
from typing import Union, IO

async def transcribe(file_name: Union[Union[str, bytes, PathLike[str], PathLike[bytes]], int]):
    with open(file_name, "rb") as audio:
        source = {"buffer": audio, "mimetype": "audio/wav"}
        response = await deepgram.transcription.prerecorded(source)
        return response["results"]["channels"][0]["alternatives"][0]["words"]

if __name__ == "__main__":
    try:
        deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        words = loop.run_until_complete(transcribe("sound.wav"))
        string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)
        print(f"You said: {string_words}!")
        loop.close()

    except AttributeError:
        print("Please provide a valid `DEEPGRAM_API_KEY`.")
Enter fullscreen mode Exit fullscreen mode

The above script will generate the following if the audio file contains only the words "hello" and "world":

You said: hello world!
Enter fullscreen mode Exit fullscreen mode

Connecting Pyaudio and Deepgram

๐Ÿ” Go To TOC.
import wave
from typing import List, Optional, TypeVar, Union, IO
import pyaudio #  type: ignore
from deepgram import Deepgram # type:  ignore
import asyncio
import os
from os import PathLike


WaveWrite = TypeVar("WaveWrite", bound=wave.Wave_write)


def init_recording(
    file_name: Union[str, IO[bytes]] = "sound.wav", mode: Optional[str] = "wb"
) -> WaveWrite:
    wave_file = wave.open(file_name, mode)
    wave_file.setnchannels(2)
    wave_file.setsampwidth(2)
    wave_file.setframerate(44100)
    return wave_file

def record(wave_file: WaveWrite, duration: Optional[int] = 3) -> None:
    py_audio = pyaudio.PyAudio()
    audio_stream = py_audio.open(
        rate=44100,  # frames per second,
        channels=2,  # stereo, change to 1 if you want mono
        format=8,  # sample format, 8 bytes. see inspect
        input=True,  # input device flag
        frames_per_buffer=1024,  # 1024 samples per frame
    )
    frames = []
    for _ in range(int(44100 / 1024 * 3)):
        data = audio_stream.read(1024)
        frames.append(data)
    wave_file.writeframes(b"".join(frames))
    audio_stream.close()

async def transcribe(file_name: Union[Union[str, bytes, PathLike[str], PathLike[bytes]], int]):
    with open(file_name, "rb") as audio:
        source = {"buffer": audio, "mimetype": "audio/wav"}
        response = await deepgram.transcription.prerecorded(source)
        return response["results"]["channels"][0]["alternatives"][0]["words"]

if __name__ == "__main__":
    # start recording
    print("Python is listening...")
    wave_file = init_recording() #  type: ignore
    record(wave_file)
    wave_file.close()
    # start transcribing
    deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    words = loop.run_until_complete(transcribe("sound.wav"))
    string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)
    print(f"You said: {string_words}!")
    loop.close()
Enter fullscreen mode Exit fullscreen mode

Handle Exceptions

๐Ÿ” Go To TOC.

Now, we need to handle errors to make our app more user-friendly by using the try-catch block to handle expected exceptions instead of causing our program to crash. The first error happens when your DEEPGRAM_API_KEY is not correct, and this will cause the program to throw an Unauthorized exception.

    try:
        # start recording
        print("Python is listening...")
        wave_file = init_recording() #  type: ignore
        record(wave_file)
        wave_file.close()
        # start transcribing
        deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        words = loop.run_until_complete(transcribe("sound.wav"))
        string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)
        print(f"You said: {string_words}!")
        loop.close()
    except Exception:
        print("Unauthorized user. Please provide a valid `DEEPGRAM_API_KEY` value")
Enter fullscreen mode Exit fullscreen mode

We can build a loop to record speech indefinitely until a condition is satisfied.

    try:
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        while True:
            wave_file = init_recording() #  type: ignore
            print("Python is listening...")
            record(wave_file)
            wave_file.close()
            # start transcribing
            deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))
            words = loop.run_until_complete(transcribe("sound.wav"))
            string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)
            print(f"You said: {string_words}!")
            if string_words == "stop":
              print('Goodbye!')
              break
        loop.close()
    except Exception:
        print("Unauthorized user. Please provide a valid `DEEPGRAM_API_KEY` value")
Enter fullscreen mode Exit fullscreen mode

I/O operations bound the performance of this program. We will improve this version in the upcoming articles related to speech recognition.

Wrapping Up

๐Ÿ” Go To TOC.

In this article, We have explored the history of speech recognition, and we learned how to use deepgram python SDK for speech recognition and pyaudio for audio recording. There is a lot more you can do with these libraries, which is beyond the scope of this article. Keep in mind that we can improve our project to directly send audio recordings from the microphone without writing into a wave file with the help of web sockets which is the work of future articles. We can also build a voice-controlled search engine based on this. I want to suggest playing around with the webbrowser module to find even more exciting implementation ideas. We will be working on these kinds of projects throughout the upcoming articles on this series.

As always, this article is a gift to you, and you can share it with whomever you like or use it in any way that would be beneficial to your personal and professional development. Thank you in advance for your ultimate support!

Happy Coding, folks; see you in the next one.

Reference

๐Ÿ” Go To TOC.

[0] wikipedia. Deep learning.

[1] wikipedia. Geoffrey Hinton.

[2] William F. Katz, 2016. What Produces Speech: Your Speech Anatomy, Phonetics For Dummies.

[3] Jacquelyn Cafasso, 2019. What Part of the Brain Controls Speech?, healthline.

[4] Steven W. Smith, in Digital Signal Processing: A Practical Guide for Engineers and Scientists, 2003.

[5] Sam Lawson, 2018, Bell-Laboratories-invented-Audrey, ClickZ.

[6] Pioneering Speech Recognition, IBM.

[7] IBM Cloud Education, 2020, What is Speech Recognition.

[8] Project Veritas, 2022, Military Documents about Gain of Function contradict Fauci testimony under oath, Youtube.

[9] Sebastian Macke, Software Automatic Mouth - Tiny Speech Synthesizer, Github.

[10] Dimitrakakis, Christos & Bengio, Samy. (2011). Phoneme and Sentence-Level Ensembles for Speech Recognition EURASIP J. Audio, Speech and Music Processing. 2011. 10.1155/2011/426792.

[11] Ed Grabianowski, How Speech Recognition Works.

[12] News from Google, 2008, New Version of Google Mobile App for iPhone, now with Voice Search.

[13] Deepgram, Different Environments Call for Different Speech Recognition Models.

[14] Deepgram, High Accuracy for Better Speech Analysis.

[15] Deepgram, WHY DEEPGRAM: Enterprise audio is complex Your ASR doesnโ€™t have to be.

[16] Deepgram, Every customer. Heard and understood.)

Discussion (0)