Emotional 3D Animation Generation in VR

Abstract

Social interactions incorporate various nonverbal signals to convey emotions alongside speech, including facial expressions and body gestures. Generative models have demonstrated promising results in creating full-body nonverbal animations synchronized with speech; however, evaluations using statistical metrics in 2D settings fail to fully capture user-perceived emotions, limiting our understanding of the effectiveness of these models. To address this, we evaluate emotional 3D animation generative models within an immersive Virtual Reality (VR) environment, emphasizing user-centric metrics—emotional arousal realism, naturalness, enjoyment, diversity, and interaction quality—in a real-time human–agent interaction scenario. Through a user study (N=48), we systematically examine perceived emotional quality for three state-of-the-art speech-driven 3D animation methods across two specific emotions: happiness (high arousal) and neutral (mid arousal). Additionally, we compare these generative models against real human expressions obtained via a reconstruction-based method to assess both their strengths and limitations and how closely they replicate real human facial and body expressions. Our results demonstrate that methods explicitly modeling emotions lead to higher recognition accuracy compared to those focusing solely on speech-driven synchrony. Users rated the realism and naturalness of happy animations significantly higher than those of neutral animations, highlighting the limitations of current generative models in handling subtle emotional states. Generative models underperformed compared to reconstruction-based methods in facial expression quality, and all methods received relatively low ratings for animation enjoyment and interaction quality, emphasizing the importance of incorporating user-centric evaluations into generative model development. Finally, participants positively recognized animation diversity across all generative models.

Video

Supplementary video overview. VR-based interaction demonstrations with state-of-the-art gesture generation methods, method comparisons across HEA/NEA/DV conditions, quality analysis focusing on HEA condition, side-by-side HEA vs NEA emotion comparison, and reconstruction sequences from real human driving video.

Qualitative Evaluation

Qualitative evaluation. Top: Animation frames from EMAGE, TalkSHOW, and AMUSE+FaceFormer methods. Bottom: Reconstruction-based baseline workflow using PIXIE+DECA for pose parameters, normal maps, and textures from driving video input.

Citation

@article{chhatre2025evaluation, title={Evaluation of Generative Models for Emotional 3D Animation Generation in VR}, author={Chhatre, Kiran and Guarese, Renan and Matviienko, Andrii and Peters, Christopher Edward}, journal={Frontiers in Computer Science}, volume={7}, pages={1598099}, year={2025}, publisher={Frontiers} } @inproceedings{chhatre2025evaluating, title={Evaluating Speech and Video Models for Face-Body Congruence}, author={Chhatre, Kiran and Guarese, Renan and Matviienko, Andrii and Peters, Christopher}, booktitle={Companion Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games}, pages={1--3}, year={2025} }

Additional Related Projects

Synthetically Expressive: Evaluating gesture and voice for emotion and empathy in VR and 2D scenarios

Haoyang Du, Kiran Chhatre, Christopher Peters, Brian Keegan, Rachel McDonnell, Cathy Ennis

ACM International Conference on Intelligent Virtual Agents (IVA), 2025

website / arxiv / video /

AMUSE: Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

Kiran Chhatre, Radek Daněček, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, Timo Bolkart

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

website / arxiv / code / video /

EMOTE: Emotional Speech-Driven Animation with Content-Emotion Disentanglement

Radek Daněček, Kiran Chhatre, Shashank Tripathi, Yandong Wen, Michael J. Black, Timo Bolkart

ACM SIGGRAPH Asia Conference Papers, 2023

website / arxiv / code / video /

@inproceedings{10.1145/3610548.3618183, author = {Dan\v{e}\v{c}ek, Radek and Chhatre, Kiran and Tripathi, Shashank and Wen, Yandong and Black, Michael and Bolkart, Timo}, title = {Emotional Speech-Driven Animation with Content-Emotion Disentanglement}, year = {2023}, isbn = {9798400703157}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3610548.3618183}, doi = {10.1145/3610548.3618183}, abstract = {To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.}, booktitle = {SIGGRAPH Asia 2023 Conference Papers}, articleno = {41}, numpages = {13}, keywords = {Computer Graphics, Computer Vision, Deep learning, Facial Animation, Speech-driven Animation}, location = {Sydney, NSW, Australia}, series = {SA '23} }

Spatio-temporal priors in 3D human motion

Anna Deichler*, Kiran Chhatre*, Christopher Peters, Jonas Beskow
(* denotes equal contribution)

IEEE International Conference on Development and Learning (StEPP) workshop, 2021

website / paper /

Acknowledgments

We thank Peiyang Zheng and Julian Magnus Ley for their support with the technical setup of the user study. We also thank Tairan Yin for insightful discussions, proofreading, and valuable feedback. This project has received funding from the European Union's Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 860768 (CLIPE project).