InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds

*Equal Contribution
1ETH Zürich, 2Max Planck Institute for Intelligent Systems, Tübingen

InstantAvatar reconstructs animatable high-fidelity human avatars from monocular video within 60 seconds, providing poses and masks, and can animate and render the model at interactive rate.

Abstract

In this paper, we take a significant step towards real-world applicability of monocular neural avatar reconstruction by contributing InstantAvatar, a system that can reconstruct human avatars from a monocular video within seconds, and these avatars can be animated and rendered at an interactive rate. To achieve this efficiency we propose a carefully designed and engineered system, that leverages emerging acceleration structures for neural fields, in combination with an efficient empty space-skipping strategy for dynamic scenes. We also contribute an efficient implementation that we will make available for research purposes. Compared to existing methods, InstantAvatar converges 130x faster and can be trained in minutes instead of hours. It achieves comparable or even better reconstruction quality and novel pose synthesis results. When given the same time budget, our method significantly outperforms SoTA methods. InstantAvatar can yield acceptable visual quality in as little as 10 seconds training time.

Video

Method

For each frame, we sample points along the rays in posed space. We then transform these points into a normalized space where we remove the global orientation and translation of the person. In this normalized space, we filter points in empty space using our occupancy grid. The remaining points are deformed to canonical space using an articulation module and then fed into the canonical neural radiance field to evaluate the color and density.

Result

Training progression

Novel View

Novel Animation

BibTeX

@article{jiang2022instantavatar,
  author    = {Jiang, Tianjian and Chen, Xu and Song, Jie and Hilliges, Otmar},
  title     = {InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds},
  journal   = {arXiv},
  year      = {2022},
}