Audio Embeddings: Understanding the basics

#ai #machinelearning #nlp #computerscience

In the age of Artificial Intelligence, Large Language Models, and Natural Language Processing (NLP), there's a lot of buzz around embeddings and semantic search. These Large Language Models have harnessed the potential of high-dimensional word embeddings, a process that translates text into mathematical vector representations. This methodology is very useful for NLP applications and is a cornerstone in the creation of modern, highly accurate search engines.

However, the generation of embeddings isn't restricted solely to text. It is indeed feasible to generate embeddings from a diverse range of media formats, including video, which will be the central topic of this article. By generating embeddings from audios you can perform Query-By-Example (QBE), audio pattern recognition, audio analysis and audio semantic search.

Applications of audio embeddings

You might be scratching your head and asking, "Why should I care about audio embeddings? What's the big deal?" Well, if you're jamming to tunes on Spotify or binging videos on YouTube, guess what? Audio embeddings are already working behind the scenes for you. These little math wizards help Spotify and YouTube tailor the music and recommendations specifically to your taste. Plus, they play a big part in their search system, helping you find that one specific track you've got stuck in your head.

Have you ever used the app "Shazam"? It's a handy tool that helps you identify the title of a song you enjoy but don't know the name of. This happens often when we're listening to the radio. Shazam accomplishes this by capturing a short clip of the song and then searching for a match in its database. While it doesn't use audio embeddings directly, it utilizes a technique called audio fingerprinting, which is quite similar.

How to generate embeddings from an audio?

You can generate audio embeddings in several ways, but it will involve artificial neural networks in most times. It's a pretty recent field, hence so far (2023) there isn't a standard and "better" way to do this, you need to analyse all the algorithms and decide which one best fits to your needs. In this article we are going to review 5 different algorithms:

Mel-frequency cepstral coefficients (MFCCs)
Convolutional Neural Networks (CNNs)
Pretrained Audio Neural Networks (PANNs)
Recurrent Neural Networks (RNNs)
Transformers Architecture

If you are not familiar with the concepts of semantic search or vector embeddings, I recommend you to read this articles first: What is a vector embedding? and Vector Databases

Mel-frequency cepstral coefficients (MFCCs)

Imagine you're listening to a song. The song is made up of different pitches, volumes, and tones, which all change over time. The human ear picks up on these changes and our brain interprets them as music.

MFCCs try to mimic this process in a mathematical way. They take a sound, break it down into its different frequencies (like bass or treble), and then represent these frequencies in a way that's similar to how our ear hears them. This creates a set of numbers, or "coefficients", which can be used as the embeddings of the song.

Convolutional Neural Networks (CNN)

Convolutional Neural Networks (CNNs) are a class of deep learning models highly effective for analyzing sequential data like audio. Instead of processing grid-like structures, as in images, they process audio signals to capture temporal dependencies and patterns. By employing convolutional layers, CNNs apply filters to local segments of audio data, learning from simple features like waveforms in initial layers to more intricate sound patterns or motifs in deeper layers.

Pretrained Audio Neural Networks (PANNs)

PANNs are deep learning models that have been trained on extensive audio data. They have the capacity to recognize a wide array of acoustic patterns, which equips them to be functional in various audio-related tasks. A PANN is quite versatile because it can extract patterns from all types of audio, including noises. If you require something more specific, the model can be fine-tuned to cater to your needs.

Recurrent Neural Networks (RNNs)

RNNs are another type of deep learning model that has been widely used for audio tasks. Unlike CNNs and PANNs, which process an entire audio signal at once, RNNs process the audio signal in a sequential manner. This makes them especially well-suited to tasks involving time series data, like audio signals. Audio embeddings can be created with an RNN by passing the audio signal through the network and using the output from the last time step as the embedding.

Transformers Architecture

Transformers are the same technology used in ChatGPT and Google Bard, they are pretty recent, yet very relevant and popular. Transformers use a mechanism called attention to weigh the importance of different parts of the input data. This allows them to capture long-range dependencies in the data, which is particularly useful for audio signals. Audio embeddings can be created with a Transformer by passing the audio signal through the network and using the output from the last layer as the embedding.