(cross-post from my personal blog)
When making VR experiences of all stripes the most important consideration is "how does this help or hurt immersion?" If sound design takes a user out of the space it is bad sound design. Now, unfortunately this can mean different things for different experiences and a focus on realism above everything else is not a panacea. Instead, the best method of building a soundscape is asking yourself "does this fulfill expectations?" When you add a fire in a game but a recording of an actual fire sounds wrong when compared to crumpling a paper bag it's more important to select the proper sounding audio over the more technically accurate sound 1. As a result, before releasing your sound design into the world it's important to take a good long critical listen to all of your sound elements and make sure they sound good in general, in the context that they are played in, and in relation to the other sounds in the environment.
A major difficulty in 3D audio is everyone has a weirdly shaped head, with weirdly shaped ears with a body that occupies space weirdly and that it's entirely possible that something that sounds completely natural to you and your normal shaped head and ears and body will sound wrong to someone else. There's not a huge amount you can do about this shy of replacing the entire human population with clones of yourself so it's important to get lots of people to listen to what you're building regularly and if you realize you've just spent hours dropping a physics object on the floor unsure if it sounds right remember it's entirely possible that it's you that has the weirdly shaped head, body, etc. To build a somewhat general model of what something sounds like from the perspective of a human being we use a Head-Related Transfer Function.
A Head-Related Transfer Function (or HRTF) is a function used to help simulate the 3D positioning of a sound by calculating not only how that sound reflects off of inanimate objects in a scene like the walls, or tables, but also how sound bounces off our bodies and into our eardrums. HRTF takes into account the shape of ears, head, and body and calculates how a given audio clip will sound from a particular location in a way that is natural to humans.
Game sound for the longest time has been limited to interacting with a 2D plane of a screen and getting sound information out. This form factor opens the door to using hacks to optimize for performance. Background music and ambiences don't need to have a source in the world, and by using panning and attenuation a perfectly convincing soundscape can be derived. The angle of a player's head as well as orientation is extremely limited compared to VR experiences where they can be just about anywhere at any given time. Not only that but sound emitters themselves can also move either programmatically, or by player interaction.
Stepping up from games experienced on 2D planes are 360 degree videos and experiences that are not roomscale. Panning audio to the left and the right no longer is sufficient as the pitch, and roll of a player's head must be taken into account. To solve this problem a technology first developed in the 1970s has started seeing more widespread use -- ambisonics. Ambisonic recordings are made with a special array of (usually) four microphones positioned to record as close to 360 degrees as possible. Using ambisonics it is possible to get a full representation of where sounds have been emitted from and then play them back to a listener. Unlike surround sound ambisonics do not target specific speakers to play particular clips of audio but instead figure out which speaker to play from using the recording itself. Perhaps the main limitation of ambisonics is that it has a small sweet spot and requires a general knowledge of where the listener is -- roomscale isn't an option as soon as a player leaves the defined centrepoint of the audio experience things fall apart.
So instead of trying to emulate the 3D environment we occupy we try to represent the sound as it hits yours ears -- this is where binaural audio comes in. To record audio that emulates the way you hear sounds with your ears two microphones are used positioned on either side of a dummy head (as your brother might be called) as opposed to the array of 4 microphones that ambisonic recordings use. Binaural recordings attempt to create sounds as they are heard instead of how they are emitted and can only be listened to on headphones as a result.
For VR you're generally not actually recording the sounds using either of these methods but draw from concepts used to record binaural sounds -- which is to say, attempt to recreate audio as we hear it. It's simply not flexible enough to build out a single mix and call it a day when you have all these pesky humans listening to it. Instead we use mono sounds and code to generate binaural sounds on the fly. This is why audio can be CPU intensive if you have to calculate out where all the sound waves are bouncing from all the time.
There are three main factors in building a convincing audio mix for VR -- reverb, occlusion, and spatialization. Reverb is using echoes to add a sense of space to a scene, occlusion is what happens when there is an object between a sound emitter and the listener, and spatialization is the use of filters, delay, and attenuation to trick your ears into thinking a sound is coming from a particular location.
Spacialized audio can be very computationally expensive as reflections are calcuated at run time for the most part. I'm sure hand coding these calculations is completely reasonable for super humans but in most cases you'll want to use some kind of library or plugin to handle the math -- there are three main projects to do this Oculus' spatialization plugin, Steam Audio, and Google Resonance. There is a guy who has done an amazing comparison of these three options and if you're trying to decide on which to use for your project I'd urge you to listen to his samples here.
Audio persists in being the unsung hero of immersion in VR. It can be used actively to draw attention to things or passively to fill out a room. I hope to have given you a launchpad for the words, and concepts for audio in VR so you can go forth and make things that sound cool.
We can go ahead and blame Hollywood for messing with our expectations. ↩