Occlusion: the Problem of Putting Us in the Picture with Augmented Reality

iflexion profile image Iflexion ・12 min read

In 2016, the phenomenon of Pokémon Go gave augmented reality (AR) its first “killer app”—something that virtual reality (VR) had been seeking in vain for 25 years. In spite of Go's whimsical nature, this was the first time that the public was able to interact in the real world, and in real time, with computer-generated elements that were integrated into a real-life experience.
Pokémon Go was able to accomplish this while working within the technical constraints of common mobile devices, rather than tantalizing consumers with YouTube video demonstrations of impressive feats destined to remain vaporware—at least in terms of an actual RTM product. In the end, pole place went to the software that actually shipped.
Unfortunately, from a commercial point of view, the free-roaming nature of Go set such ambitious goals for Mixed Reality (MR) that the state-of-the-art was soon to hit an implacable roadblock.

VR and AR Converge

Historically, VR was never able to get away from coin-fed mall machines, specialist urban experiences, or a hamstrung series of over-priced, over-tethered, or underpowered home devices over the course of the last thirty years.
The computational and graphical demands of full-view, high-resolution interactive environments are so exhausting that no leap in technology has been able to significantly shrink the bulky headsets or the anchoring support systems they often need, such as a PC. Neither could it remove the sense that this is an isolating, expensive, and anti-social technology at odds with an age beguiled by lightweight wearables and obsessed with digital social interaction.
Augmented reality seemed to provide a more modern approach. First defined in 1990 by Boeing researcher Tom Caudell for a proposed industrial application, AR overlays the real world with artificial elements—a concept that hails back to WWII, and that first hit the car market in the late 1980s.
Current commercial interest in AR has been fueled by the development of wearable visors that are much more discreet than VR headsets and that can superimpose interactive digital elements in this way, instead of replacing our view completely with a digital environment as VR does. Market offerings at the moment include Microsoft Hololens, Magic Leap, Google Glass Enterprise Edition, Vuzix Blade, the (struggling) Meta 2, Optinvent's ORA 2, and the Varia Vision, amongst others.
At this stage, the distinction between AR and VR is becoming academic. Since AR can completely obscure the user's vision on demand, and VR systems increasingly make real-world views available, development is headed by consensus towards the “Mixed Reality” experience.
The Pokémon Go phenomenon has fueled consumer enthusiasm for flexible AR experiences that take place in non-controlled environments: streets, parks and open spaces. Advertisers want discoverable creations that dazzle and amaze; gaming companies want virtual worlds overlaid on public worlds. If we were going to stay home or enjoy AR only in the prescribed spaces of special events or fixed attractions, we might as well be back in the 1990s watching the last Virtual Reality bubble deflate under its technical and geographical limitations.
All this means that virtual elements in AR need to behave as if they were real objects—to be able to hide under tables, to disappear behind real structures such as columns, doorways, walls, cars, lampposts, people; even skyscrapers, if the scenario should call for it. In computer vision, this is called occlusion, and it's essential for the future of genuinely immersive augmented reality systems.
However, it's going to be a bit of a problem.

Three Common Approaches to AR Mapping

An AR system needs to understand the geometry of the environment it’s in: the tables, chairs, walls, alcoves, and any other object which could potentially need to appear to be in front of a virtual element during the AR session.

Furniture mapped by the Microsoft HoloLens

Furniture mapped by the Microsoft HoloLens

It also needs to be able to recognize and extract real people who might also be in the AR space with you, since they may end up standing in front of virtual objects or structures.
If the system can’t do this, it can't mask off the sections of a virtual element which logically should not be visible, such as the lower legs of a virtual robot that is standing behind a real table in front of you. Nor, in the case of real participants, can it “cut them out” of any virtual objects that they may be standing in front of.

A real person sandwiched between two non-real elements: the colorful background and the foreground objects.

If the system can extract these “mattes,” but can’t do it fast enough or to an acceptable quality, either the virtual elements will lag behind the real elements, or the matte borders will be indistinct or unconvincing.
There are three primary methods by which AR systems build the invisible 3D environment models that are necessary for realistic occlusion. Some need more preparation than others; some are more suited to certain situations than others, and none are ideal for a truly untethered and spontaneous mixed reality experience.


The Time-Of-Flight sensor pulses infrared light rapidly at the environment and records the time that these emissions are bounced back to the sensor. Objects that are further away will return that light later than nearer objects, enabling the sensor to build up a 3D image of the room space.
Version 2 of Microsoft’s Kinect sensor uses TOF, as do the LIDAR systems common in autonomous vehicle research. These sensors are also widely used in industrial applications.
Outdoors, TOF has severe limitations, since natural daylight will distort its results, and multiple TOF sensors (which would be helpful to avoid undercuts in AR mapping) are prone to interfere with each other's effectiveness.
A TOF sensor suitable for a portable AR device (dedicated headset or phone) also has a maximum usable range of around four meters, so it certainly can’t help us to put a virtual Godzilla behind the Empire State Building.

Stereo Camera

Stereo cameras can generate 3D geometry from their native 2D depth maps, by comparing the differences between the two images. This technique is widely used in AR applications and hardware.
On the plus side, this method can work well in real time, if necessary, allowing for more spontaneous and “live” environment mapping.
Negatively, it works poorly in bad light, fails to account for undercuts (i.e. distinguishing an alcove in a far wall from a doorway) or other geometric anomalies that a preliminary mapping session might have identified, and operates at a significantly lower resolution than the actual camera output, often making for rough and unconvincing mattes.

In this video from Google's Project Tango, where live occlusion maps are generated from RGB-D data, we see the notable difference in quality between the video image and the available depth map.

Worse yet, this approach can be completely undermined when the system is asked to map a blank or featureless area, such as a white wall.
Since a stereo camera mapping system works under the same limitations as a pair of human eyes, it can't distinguish meaningful 3D information beyond a range of four meters. So, once again, Godzilla is out of luck.

Structured Light 3D Sensor

Here a striped infra-red light pattern is projected onto the real-world shapes of the environment and its contours reconstructed by calculating the way that the lines distort, from the point of view of an interpreting camera.

By observing and comparing how straight lines are distorted when projected onto objects, SLM can compare the differences between the two viewpoints at its disposal and deduce the 3D geometry of the objects and the environment.

SLS is used in the forward-facing depth sensors of the iPhone X (the rear sensors use stereo cameras), the focus of many emerging AR technologies and players, and the primary hardware considered for Apple’s influential ARKit.
SLS is undermined by bright ambient light, and therefore not a logical solution to mapping AR scenes that take place outdoors. Like TOF (see above), SLS uses infrared bounce returns and, in a viable AR scenario, is unlikely to place its sensors much further distant from each other than human eyes are placed. This limits the technique to an effective range of—you guessed it: four meters.

Network Solutions?

The low latency and high data throughput of upcoming 5G networks are likely to prove tempting to AR developers who, in their hearts, might prefer to address these issues on more powerful base-station nodes, and turn the AR headset into a relatively dumb playback device tethered across the network.
But building such responsive network models into core AR frameworks seems likely to limit full-fledged AR systems to urban environments where 5G connectivity is widely available and affordable. In effect, it's yet another potential geographical anchor dragging against the dream of on-the-fly AR experiences.

Redundant Effort in AR Mapping

There is no real intelligence or persistence behind any of the popular current methods of generating 3D models for occlusion. When an AR system “mattes out” another person in your virtual meeting so that they can appear to be in front of a building that will not begin construction for another eight months, that person is just another “bag of pixels” to the occlusion system. If she walks out of view and back into view, the system has to start analysis on her all over again.

Source photo credit: ResearchGate

In terms of static objects, this redundancy of effort is even worse: no matter how recognizable an object might be, AR occlusion systems must currently create an occluding mesh from scratch, every time. There are no shortcuts, no templates, and the ponderous and ongoing nature of the analysis makes lag or poor-quality mattes almost inevitable. Either is fatal for an authentic augmented reality experience.

A “Google Maps” for Augmented Reality Geometry?

The best solution for enabling AR occlusion in public environments would be to download already-existing geometry that has been created and indexed by a tech giant. But generating a 3D “map of the world” at this level of intricacy and resolution is so daunting a task, even for the likes of Apple or Google, that it might need instead to be made practicable by crowdsourcing.
Current efforts in this direction are either limited to the walled gardens of individual tech ecosystems, such as Apple's ARKit and HoloLens' ability to save a user's mappings — or else represented by a myriad of startups apparently hoping for a Google buyout if Polly (see below) becomes a major player in public-facing AR. These include YouAR, Jido, Placenote, and Sturfee, among others.
As such a database grows, it would become more accurate, gradually learning to discard transient structures such as cars, construction equipment, chained bicycles, and seasonal Christmas decorations. Eventually, it will learn to represent a consistent and persistent archive of the geometry and occlusion mapping of the area.
First, however, the information must be gathered at a resolution that’s acceptable for general consumption, at a speed which won’t discourage volunteer contributions, and at a quality that doesn’t need special equipment or elaborate methods. Machine learning seems set to provide the answer.

Assembling AR Mappings with Machine Learning

Mapping the complex contours of a car is currently a considerable challenge for state-of-the-art AR. The curves are complex, the surface is reflective, and parts of the object are transparent.
Machine learning, on the other hand, is already well able to potentially recognize any model of car. Having identified a vehicle, an ML-driven environment recognition system could then download simplified, low-res geometry from a common database of 3D car meshes and position it exactly where the identified car is in real life, enabling occlusion for the vehicle, complete with transparency, at the cost of a few kilobytes of data and a few network seconds of scene-setting.
Such a system could also make use of the growing number of AR/VR meshes at Google Polly—which, it seems, may eventually become the Google Maps of AR, else fold into Maps as an AR-focused sub-service.
Likewise, ML-enabled AR scanning systems would be able to classify individuals in a scene and understand what (and perhaps who) they are, if they should disappear from view and then reappear again.
Further, local mobile neural networks could solve the four-meter scanning limitation by recognizing objects semantically. They would even be capable of distinguishing between objects that are impermanent (such as parked cars and bystanders), objects which are more likely to need occlusion (lamp-posts), and distant objects such as skyscrapers.
With such a comprehensive understanding of scale and depth, currently unavailable to AR without specialist pre-mapping, it would finally be possible to create augmented reality experiences that use occlusion intelligently and without dimensional limitations.

Generating 3D models for Occlusion with Neural Networks

Not every item in a scene would have a corresponding low-poly mesh available from a network. Sometimes the device would need to recreate the geometry the hard way, as is currently the standard practice.
Luckily, creating 3D meshes from the depth maps of RGB-D images is one of the core pursuits in computer vision research. In this period, a handful of companies are cautiously releasing ML-based systems for generating geometry with low-impact, on-device neural networks.
One such is Ubiquity6, which claims to use a mix of deep learning, 3D mapping, and photogrammetry to recreate an observed environment in an impressively fast 30 seconds.

However, the company has released no specific details of its methodology and seems to avoid the subject of occlusion.
Spun out of the Oxford Active Vision Lab, 6D.ai is less shy about the thorny subject of occlusion:

Like the vast majority of seminal AR occlusion systems in the headlines, evidence of object-hiding is generally of the “blink, and you’ll miss it” variety. In this 6d.ai Twitter video, a ball drops off a domestic surface:
A brief glimpse of the geometry (in green, below), suggests the rough edges of the standard polygon shape generated by a low-resolution depth-map layer:

Source photo credit: Twitter

In this video, the company demonstrates the occlusion potential of its system:

The occlusion mapping is very approximate, even in this showcase, and even where objects are simple cuboids.
This may demonstrate an inherent problem in AR systems aimed at mobile devices, which are likely to need to translate even the cleanest poly meshes into relatively pixelated RGB approximations with jagged edges that distort the occlusion. Though 6D.ai claims that its system uses no depth cameras, it could be coming up against this bottleneck in practice. Without details of the implementation, it is hard to tell.
The meshes that the 6D.ai system creates are actually more complex and resource-intensive than would be needed for occlusion, with a lot of redundant detail, such as the indentation in the cushions in the mapping of this sofa:

Source photo credit: artillry.co

Only a small fraction of the points generated (usually by RGB-D depth maps) in this way are needed to create useful occlusion. However, building such a frugal and lightweight mesh may require a more intelligent approach to geometry creation, and perhaps a deeper and more inventive application of local neural networks.
We can also note from the above video (1:30) that the “live meshing,” which maps the real world in real time, never attempts to map anything further away than four meters, or risks pointing the camera up beyond that distance. Here is where the machine learning-based AR mapping systems of the future seem set to lend a hand, by freeing the AR system from the three limited methods of geometry mapping mentioned earlier.

Machine Learning Solutions Go Mobile

In 2014, researchers from the University of Bonn proposed a system of “live” segmentation capable of generating occlusion, driven by convolutional neural networks (CNNs). The technique involves classifying every pixel in a video frame according to the depth map data of the RGB-D image.
This is one of many machine learning-based extraction and geometry-generating techniques that may eventually be transferable to mobile devices. All such approaches depend on the evolution of local machine learning hardware and software implementations on popular, portable devices.
Fortunately, there is a notable impetus in ML research towards this optimization and migration to local processing where possible—a movement driven by the demands of IoT and Big Data, and the enthusiasm of consumer hardware producers to leverage machine learning in their devices. Slimmed-down machine learning frameworks and workflows are being met halfway by the major manufacturers incorporating AI-oriented hardware into their product lines.

Useful Limitations

Solving occlusion still leaves some other issues to clear up, such as effective hand presence, matched lighting, shadows, and ghosted overlaid images. However, nothing affects the “reality” of AR more than the potential to integrate virtual worlds into our world.
It may be beneficial to the ultimate development of AR mapping systems that current resources are so scant, and the margins so critical, since these conditions are typically a spur to invention. The question is whether those constraints will produce the ingenious breakthroughs in time to fend off a feared AR winter.
The failure of the ponderous VR model to take flight in over three decades, combined with the (to date) unique phenomenon of Pokémon Go, indicates that AR must aim higher than tethered home video games, or special urban events that a majority of consumers live too far away from. It cannot afford to be an urban or exclusive technology, because the intense effort and infrastructure that will support it needs the same economy of scale as the smartphone sector itself.


Editor guide