Many Challenges and Opportunities in Mobile AR

8 min readJul 5, 2021

AR Session review of MobiSys 2021

What is MobiSys?

MobiSys is an ACM conference that focuses on Mobile Systems, Applications, and Services. It is considered a top conference for mobile computing and this year, it took place on Mars :)

Three brand new mobile AR papers

Today I am going to provide an overview and my thoughts on attending the Augmented Reality session (Session II). This session includes three papers that target three important problems for mobile AR.

The first paper FollowUpAR addresses the limitation of 6-DoF pose tracking of physical objects to handle moving objects.
The second paper LensCap presents a fine-grained permission control for AR applications for better privacy control.
The third paper Xihe achieves real-time lighting estimation for photorealistic rendering of 3D objects. (Disclaimer: I am one of the authors!)

I am going to present in the order of Xihe, FollowUpAR, and LensCap based on my topic familiarities.

Xihe: A 3D Vision-based Lighting Estimation Framework for Mobile Augmented Reality

What is the key problem addressed by Xihe?

The diagram below shows three example AR scenes rendered with lighting information provided by ARKit. As you probably agree, none of the obects look particularly realistic; rather they all look quite dark even in a well-lit room. Well, it turns out that commercial AR platforms such as ARKit lack support for spatially variant lighting estimation. Xihe aims to bridge the gap by providing mobile AR applications the ability to obtain accurate omnidirectional lighting estimation in real time.

screenshot obtained from the original Xihe paper: ARKit’s visual effect.

Just to be extra clear, spatially variant lighting estimation refers to the ability to produce lighting estimation at any given world position. Using spatially variant lighting information could often result in more photorealistic rendered effects. For example, the diagram below shows the rendered effects of three Stanford bunnies.

rendered effects with spatially-variant lighting estimation.

How does Xihe solve the problem?

The key idea is to leverage RGB-D images (we used iPad Pro with a built-in Lidar sensor) and a co-designed deep learning model to extract accurate lighting information. Intuitively simple, the main challenge comes down to providing the lighting information in real time (our goal is < 30fps). But why is it hard? It turns out preprocessing, sending, and understanding point clouds can be time-consuming!

So what did we do? We developed a novel point cloud sampling that efficiently compresses the raw point cloud without impacting estimation accuracy. We refer to this technique as Unit Sphere Point Cloud (USPC) Sampling.

So how do we obtain lighting estimation? We resorted to a co-designed deep learning model called XiheNet, extended from our own prior work PointAR. Using deep learning allows us to effectively extrapolate environment scenes with partial observations, a common occurrence in real-world usages. What else? We leveraged edge resources to augment computation capability which in turn motivates a USPC-specific encoding scheme to reduce network transfer cost.

screenshot obtained from the Xihe presentation: key components.

With a holistically designed framework, Xihe now provides a more photorealistic rendering. See for yourself.

screenshot obtained from the original Xihe paper: Xihe’s visual effect.

How does Xihe handle dynamic lighting conditions, you might ask? We developed a lightweight triggering algorithm that effectively detects lighting changes. Doing so allows us to skip unnecessary lighting estimations while still being able to provide visually coherent rendered scenes.

Testing the triggering algorithm turned out to be quite challenging: how should we obtain lighting ground truth? What we ended up with is a built-in AR session recorder that allows record-and-replay, effectively enabling us to compare triggering algorithm decisions to always-triggering under different lighting conditions. The table below suggests that user movement triggered the most lighting estimations; but with our adaptive triggering algorithm, ~76% of requests can be skipped (quite a bit of energy saving)!

screenshot obtained from the Xihe presentation: dynamic lighting condition

Does that mean the problem of lighting estimation for mobile AR has been solved? Not quite. As you see below, Xihe’s performance is bounded by the deep learning models in use. Currently, Xihe works great for low-frequency lighting estimation (matte materials) but can be improved for high-frequency lighting estimation (reflective materials). We are excited to continue working on this important and challenging problem and stay tuned!

screenshot obtained from the Xihe presentation: AR scenes rendered with Xihe with different materials.

If you are interested in Xihe, take a look at the Github repo and the teaser video below.

FollowUpAR: Enabling Follow-up Effects in Mobile AR Applications

What is the key problem addressed by FollowUpAR?

Take a look at the following diagram. Commerical platform ARCore cannot seamlessly overlay virtual objects (the fire or the measurement) relatively to the physical objects (the dragon or the Rubik’s cube). The misaligned visual effects, as explained by the paper, can be attributed to failing to keep track of physical object’s 6-DoF pose.

screenshot obtained from the original FollowUpAR paper: ARCore’s visual effect.

How does FollowUpAR solve the problem?

The key idea is simple: rather than relying solely on camera observation, FollowUpAR takes advantage of another information source to improve the tracking precision. Specifically, FollowUpAR uses mmWave Radar to obtain sparse point clouds for spatial information. The diagram summarizes the key components of FollowUpAR.

screenshot obtained from the original FollowUpAR paper: key components

Intuitive enough right? Well, in theory, it is. But it turns out there are at least two challenges in realizing the key idea. The first is how to model the measurement errors of mmWave radar, and the second is how to fuse heterogeneous data with different spatial resolutions (i.e., visual feature points and radar point clouds). FollowUpAR tackles these two challenges with a Physical-level theoretical model for Radar Error Distribution and a modified factor-graph for heterogeneous data fusing.

So how well does FollowUpAR do? The virtual objects did successfully follow up with the physical movement! Pretty cool huh? I do wonder how the problem and solution could differ if we use Lidar instead of mmWave radar. Would using Lidar help simplify the software framework design or introduce new challenges? Maybe we will find out in next year’s MobiSys?

screenshot obtained from the original FollowUpAR paper: FollowUpAR’s visual effect.

More details are on the paper if you are interested in reproducing the work. Below is a teaser video of FollowUpAR for your enjoyment!

LensCap: Split-Process Framework for Fine-Grained Visual Privacy Control for Augmented Reality Apps

What is the key problem addressed by LensCap?

It prevents AR applications from deriving visual information from collected camera frames and leaking this sensitive user information.

How does LensCap solve the problem?

Conceptually, LensCap splits an AR application into two processes: a visual process that handles interactions with cameras and a network process that handles network communication. The split process design facilitates fine-grained permission control, in which LensCap provides end-users the ability to “decide what forms of visual data can be transmitted to the network, while still allowing visual data to be used for AR purposes on device”. The diagram below shows the design of LensCap on Android.

screenshot obtained from the original LensCap paper: LensCap in Android.

Additionally, LensCap provided design suggestions for mobile AR developers on four different use cases as shown below.

screenshot obtained from the original LensCap paper: four template scenarios.

I don’t actively work in the privacy & security area, but I believe that privacy will be a growing concern as AR becomes ubiquitous! Trying to address privacy issues from the outset rather than waiting until the AR market matures can be quite beneficial. While reading this paper that mainly concerns visual privacy, I also found a similar paper Unravelling Spatial Privacy Risks of Mobile Mixed Reality Data that addresses spatial privacy if you are interested in it.

I am curious how the fine-grained permission model proposed by LensCap compared to the existing Android permission model. Android went through a major overhaul and introduced a new permission model in Android M. As mobile AR (also deep learning) gets more traction, will we see another round of re-design of security and privacy models in Android? Looking forward to the development!

More details are on the LenCap paper if you are interested in reproducing the work. Below is a teaser video of LenCap for your convenience!

Specialized hardware vs. Mobile AR

Right in this article, we are not talking about AR with specialized hardware like Microsoft Hololens or Magic Leap, but with the devices you already carry around everywhere in your pocket! Specialized hardware has a bright future, but smartphones represent an unavoidable intermediate step. You may ask why?

In a nutshell, specialized hardware still costs a lot of money (~3500 dollars for HoloLens 2) and has (you guessed…) very specialized features. Smartphones on the other hand are relatively cheaper and provide a wide variety of features.

Putting cost aside, specialized hardware can be quite effective. With a little bit of hardware help, it can often trivially provide some features that require sophisticated software magic. (Sidetrack: Getting hardware’s help is rather common in computer systems, think OS design in leveraging CPU protection rings to provide kernel/user spaces.) For example, Hololens has two IR cameras for eye tracking which makes it easier to infer user intention, therefore, optimizing content rendering.

Lastly, allow me to show you Google’s project Starline that leverages 3D imaging, real-time compression, and 3D display to deliver a near in-person communication experience. Not gonna lie, I was super excited about the future presented by Starline, but I am also skeptical about its future rollout.

Afterwords

What next? Our lab will continue the journey on enabling more features for mobile AR platforms in resource-efficient ways. If you are interested in AR, consider following me on Twitter to get updates from both commercial and research worlds!