TL;DR
Sound is vibrations in the air that are converted into electrical signals—either through the ear to the brain or via a microphone to a computer. Sound has several characteristics, such as frequency and amplitude, which can be used to identify patterns. Spectrograms are a way to visualize which frequencies are present and when during a sound. They can help determine when an event starts and ends, analyze what happens during the event, or assess the quality of the event. Use cases include healthcare, where sound can evaluate the quality of a patient’s exercise, or environmental monitoring, where sound can identify which birds are present in an area.
Introduction
If a tree falls in the forest and no one is around to hear it, does it make a sound? A paradox rendered irrelevant by AI and sound event detection (SED). AI and advanced audio analysis can detect and classify sounds in various environments, even when no humans are present.
We can uncover patterns in sound that are imperceptible to the human ear. These patterns can be used to identify machine malfunctions, monitor animals in large ecosystems, or assess the quality of a patient’s condition.
Some of the tools used today include spectrograms, while the future may bring innovations like transformers and transfer learning.
Sound and Its Digital Representation
Sound consists of vibrations in the air known as sound waves. These waves are captured by the ear as they impact the membrane in the ear, commonly known as the eardrum. From there, the sound is amplified by small bones and passed into a fluid-filled structure called the cochlea. Inside the cochlea, tiny hair cells convert these vibrations into electrical signals, which are sent to the brain via the auditory nerve.
When we work with sound digitally, it must first be converted into a digital format. This is usually done through a microphone, which transforms sound waves into electrical signals. Much like the human ear, a microphone uses a membrane that reacts to sound. The membrane’s movement pushes a coil back and forth near a magnet, generating electrical signals that a computer can process.
If you’ve taken high school physics, you might remember the study of waves—and perhaps found it less than thrilling. But if you listened, you may recall that waves have various characteristics, such as frequency, intensity, and wavelength. These characteristics form the foundation for how AI processes sound.
One example of these characteristics is frequency—the number of times a wave moves up and down per second. Frequency determines whether a sound is perceived as a deep bass or a high-pitched tone. A high frequency corresponds to a higher-pitched tone, while a low frequency results in a deeper sound.
Sound in AI
When processing sound with AI, we aim to identify patterns in raw audio or its characteristics. Two of the most popular tools, spectrograms and MFCCs, focus on analyzing frequencies.
To translate sound into frequencies, a mathematical tool called Fourier Transformations (FT) is used. These transformations identify which frequencies are present in an audio clip and the magnitude of those frequencies. The result can be visualized as a graph with frequencies on the x-axis and magnitude on the y-axis, which can then be used for both filtering and pattern recognition.
An example could be a recording made in a room with wall outlets and wiring in the ceiling. In such a scenario, a significant amount of sound is often detected at 50Hz, caused by the electrical current from the power grid oscillating at that frequency. If the sound of interest occurs above 50Hz, a filter can be applied to remove lower frequencies.
Fourier Transformations are also used to create spectrograms, which allow us to analyze sound in both the frequency and time domains. To create a spectrogram, the audio is divided into smaller segments, and FT is applied to each segment. Filters are then applied to further distinguish between frequencies by dividing them into bands. Finally, these segments are combined into a plot where time is represented along the x-axis, frequency bands along the y-axis, and the magnitude of each frequency is shown through color. Without detailed analysis, one can already identify four distinct events in the clip.
Use case - Sonohaler
At Convai, we have collaborated with the company Sonohaler, which uses sound in the healthcare sector. Their mission is to make quality tools accessible to everyone by utilizing mobile devices as a replacement for traditional electronic measuring equipment. Their app incorporates AI that analyzes sound and tracks the user's progress. Convai’s role has been to work on event detection and quality measurement of the events.
For event detection, our task was to identify the start and end times of an event. We approached this as a classification problem, analyzing small segments of sound to determine whether they contained the event.
Our process began with collecting audio data containing the events of interest and annotating the start and end times. The data was then divided into small segments, each labeled as either "event" or "non-event." For each of these segments, spectrograms were generated and used in the model for training, testing, and implementation in the app. This process resulted in a model capable of analyzing an audio file and detecting where it believes an event occurs. Below is the result of the spectrogram shown earlier, with the green bars indicating where the model predicts an event.
For quality assurance, our task was to identify the intensity of an event. We approached this as a regression problem. Once again, we utilized spectrograms, as there is often a correlation between the intensity of an event and the frequency content in the sound. Similar to the classification approach, the data was broken into smaller segments and transformed into spectrograms.
The key difference here is that instead of assigning each segment a label (e.g., "event" or "non-event"), the correct answer is a continuous value representing the intensity. This intensity can range from 0 to 100, and it is up to the model to predict the appropriate intensity value.
The result is that the sound is converted into intensity values, which can later be used for analysis within the app or by a professional using the app.
The Potential of Sound
With Sonohaler, we have explored just a small portion of what sound can do for us. We utilized spectrograms for both classification and regression, but technologies like transformers, which are used in large language models, can also be applied to sound—and they are continually improving. Transformers have the potential to train on much larger datasets and to work directly with raw audio instead of relying on spectrograms.
One of the challenges of working with sound is that analysis often needs to be performed on small devices, such as phones or sensors connected to microprocessors. This makes the size and efficiency of AI models and feature extraction techniques crucial. It requires creative thinking to distill the process down to its fundamental components.
Beyond the healthcare sector, the environmental sector is another area where AI has been widely applied to sound. A classic example is identifying bird species in natural habitats. Birds can be difficult to spot, making image recognition impractical. However, each bird has a unique sound, both as a species and as an individual. By recording audio in an area and allowing AI to analyze the sound, it is possible to identify which birds are present in the audio clip.
In the industrial sector, sound is also used for anomaly detection, which involves identifying unusual sounds that are not normally present. This capability allows for the detection of machinery with reduced or faulty performance, helping to identify and fix issues before they become critical.
Some of the large-scale sound models freely available online are highly comprehensive classification models trained on thousands of YouTube videos. These models can distinguish between more than 200 different events, including people talking, cars driving by, gunshots, and other everyday occurrences. These models also provide opportunities for further training on data tailored to your specific problem.
Conclusion
Sound contains a wealth of information that can be used for far more than just playing music or facilitating conversations. By digitizing and analyzing sound, we can gain deep insights in fields such as healthcare and environmental monitoring.
Tools like spectrograms have been indispensable in recent years, but technologies like transformers may pave the way for even greater advancements. They have the potential to enable new models that perform better while remaining accessible on devices like smartphones.
We are only at the beginning of exploring what sound can offer us. With the rapid advancements in AI and machine learning, the future of sound analysis is both exciting and full of possibilities.
Top comments (0)