We strive to close the gap between the experience of face-to-face and online communications. Bringing years of research in human perception to bear on the human experience of connection while physically separated is at the heart of what we do. At a recent technical conference, Chief Architect Paul Boustead provided insights from the research in voice communications on how spatial rendering can improve intelligibility of video conferencing.
A nuisance refers to any audio perceived in a communications context that is unwanted and distracting. A police siren, the hum of a fan or air conditioner, the barking of a dog, etc. are all examples of a nuisance. In order to reduce background noise like this, many media servers take the simple approach of sending only the few loudest audio streams detected.
This approach has an unwanted side-effect though of removing affirmations. An affirmation is the backchannel sound that lets others know when people are reacting. If you tell somebody a joke, you hope for laughter which can be contagious such that when one person laughs others join in. If you don’t get that affirmation of a laugh, “um”, or an “uh-huh”, the speaker may wonder what’s going on.
In a 2001 study by Shriberg(1), it was discovered that in business meetings with 4-8 participants the majority of the conversation is made up of overlapping talk spurts. Of this, a significant amount includes verbal affirmations.
By contrast, communications in gaming such as MMOs players rarely speak, less than 5% of the time. When they do speak though, it often is overlapping speech during game action with concurrent talkers. This is one of the original use case that led to work done at Dolby to render audio spatially based on the location of players in a virtual world.
In the real-world we handle overlapping audio well. Though our ears pick up the linear sum of all the sounds we’re hearing at any given time, our brain can pull those apart to concentrate and listen. This ability to concentrate on a voice and filter out other sounds is referred to as Spatial Release from Masking.
Through an auditory scene analysis, we’re able to separate voices from other sounds. Recognizing these inter-aural time and volume differences, we are able to recognize key words and phrases. The cocktail party effect of hearing our name from across the room is an example of this. Human perception is adept at understanding overlapping audio.
Dolby’s Head Related Transfer Function (HRTF) models how a sound in a particular location would sound when it hits your ear canal. Taking into account the shape of the head, ears, reverb in the room, etc. The implementation renders all streams detected as speech with an accurate ML-based Voice Activity Detector (VAD) for a good experience with headphones.
This incorporates key findings from the research, for example the larger the separation between speakers the better. Even 15 degrees is sufficient, but layout of speakers can be done automatically balancing first in front and adding alternating to the left and right. Most people find somebody talking from behind disconcerting.
Incorporating conferencing directly into your applications with solutions like this you benefit from many years of psycho-acoustic research deployed at scale with large video conferencing providers. Rendering voices spatially allow us to better understand natural communications, including overlapping speech and sounds for the best end-user experience.
- Shriberg, Elizabeth et. Al. “Observations on overlap: findings and implications for automatic processing of multi-party conversation”, INTERSPEECH, 2001.