×
Written by
Published on
Written by
Published on
  • Publication: ACM SIGGRAPH Emerging Technologies 2023
  • Publication Date: August 6, 2023
  • Organizations mentioned: NVIDIA, UCSD
  • Publication Authors: Michael Stengel, Koki Nagano, Chao Liu, Matthew Chan, Alex Trevithick, Shalini De Mello, Jonghyun Kim, David Luebke
  • Technical background required: High
  • Estimated read time (original text): 4 minutes (original text)
  • Sentiment score: 72%, somewhat positive (100% being most positive)

TLDR

NVIDIA research team has created an AI-powered video chat system that turns regular webcam footage into lifelike 3D models of people’s heads and faces. This makes video calls feel more like in-person conversations, as users can see each other in 3D using their home computers. The AI technology is cost-effective, works in real-time, and offers features like realistic or stylized avatars and the ability to maintain eye contact in group chats. NVIDIA has tested the system on displays for both individual and multiple viewers

Methodology:

  • Proposed a neural network architecture called neural 3D lifting to encode 2D video into an implicit 3D neural representation for novel view synthesis.
  • Used generative neural network priors during 3D lifting for multi-view consistency and realism.
  • Enabled mutual eye contact through neural gaze correction to redirect eyes in 3D.
  • Tested system on consumer light field and tracked stereo 3D displays.

Key Findings:

  • Neural 3D lifting approach allows 3D conferencing from only 2D video, increasing accessibility.
  • Can generate photorealistic or stylized avatars from 2D video.
  • Light field display provides room-scale 3D telepresence for multiple viewers.
  • Tracked stereo display with eye tracking enables mutual eye contact.
  • Overall, AI techniques reduce need for expensive 3D capture equipment.

Recommendations:

  • Adopt neural 3D lifting and gaze correction for 3D avatar creation and video conferencing.
  • Integrate system with existing 2D conferencing apps for mainstream adoption.
  • Develop real-time optimizations and dedicated hardware for consumer devices.
  • Explore multi-user shared avatar spaces and creative stylization.
  • Conduct user studies to refine realism, measure benefits versus 2D calling.

Thinking Critically:

Implications:

  • Widespread adoption of AI-enabled 3D video conferencing could make remote collaboration feel much more natural and immersive.
  • 3D avatars could enable more personalized self-expression and emotional connection in virtual interactions.
  • Demand for consumer 3D displays and computing power may increase to support next-gen calling.

Alternative perspectives:

  • The quality of AI-generated 3D avatars may not yet match real 3D scanned humans or be realistic enough for widespread use.
  • Regular 2D video calling may remain dominant if 3D calling requires expensive new hardware or is seen as unnecessary.
  • Some may oppose hyper-realistic avatars due to concerns over misinformation or loss of authenticity.

AI predictions:

  • In 5 years, most major video chat apps will integrate AI-powered 3D avatar capability.
  • Lifelike metaverse interactions between photorealistic human avatars will become commonplace.
  • Specialized AI chips optimized for avatar synthesis will be developed to make 3D calling ubiquitous.

Glossary

  • Neural 3D lifting: A neural network approach to encode a 2D image into an implicit 3D neural representation that can be rendered from novel viewpoints.
  • Triplanar representation: The output of the neural 3D lifting network representing the 3D object from 3 orthogonal views.
  • Novel view synthesis: Rendering the triplanar representation from arbitrary new camera angles to synthesize new views.
  • Generative priors: Leveraging generative neural networks as priors during 3D lifting to ensure multi-view consistency.
  • Neural gaze redirection: Using a neural network to detect gaze direction in an image and redirect it to simulate eye contact.
  • Neural talking head: A neural network generated photorealistic avatar head that can be driven to match facial expressions and speech.
NIVIDA AI-powered video chat system that turns regular webcam footage into lifelike 3D models

Recommended Research Reports