VisemeNet: Audio-Driven Animator-Centric Speech Animation

ACM Transactions on Graphics (Proc. ACM SIGGRAPH 2018)

Abstract

We present a novel deep-learning based approach to producing animator- centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio. Our three-stage Long Short-Term Memory (LSTM) network architecture is motivated by psycho-linguistic insights: segmenting speech audio into a stream of phonetic-groups is sufficient for viseme construction; speech styles like mumbling or shouting are strongly co-related to the motion of facial landmarks; and animator style is encoded in viseme motion curve profiles. Our contribution is an automatic real-time lip-synchronization from audio solution that integrates seamlessly into existing animation pipelines. We evaluate our results by: cross-validation to ground-truth data; animator critique and edits; visual comparison to recent deep-learning lip-synchronization solutions; and showing our approach to be resilient to diversity in speaker and language.

Paper

VisemeNet.pdf, 2.7MB

Citation

Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, Karan Singh, "VisemeNet: Audio-Driven Animator-Centric Speech Animation", ACM Transactions on Graphics (Proc. ACM SIGGRAPH 2018) Bibtex

Video

Presentation

Slides: pptx

Source Code & Data

Github Code: https://github.com/yzhou359/VisemeNet_tensorflow

BIWI Dataset: http://www.vision.ee.ethz.ch/datasets/b3dac2.en.html

Viseme Annotation Dataset: HERE

Acknowledgements

We acknowledge support from NSERC and NSF (CHS-1422441, CHS- 1617333, IIS-1617917). We thank Pif Edwards for his valuable help. Our experiments were performed in the UMass GPU cluster obtained under the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative.