Abstract
Speech-driven 3D human face animation has made admirable progress. However, synthesizing 3D facial speakers with head motion is still an unsolved problem. This is because head motion, as a speech-independent appearance representation, is difficult to model by a speech-driven approach. To solve this problem, we propose 3D head-talk, which generates 3D face animations combined with extreme head motion. In this work, we face a key challenge to generate natural head movements that match the speech rhythm. We first form an end-to-end autoregressive model by combining a dual-tower and single-tower Transformer, with a speech encoder encoding the long-term audio environment, a facial grid encoder encoding subtle changes in the vertices of the 3D facial grid, and a single-tower decoder automatically regressing to predict a series of 3D facial animation grids. Next, the predicted 3D facial animation sequence is edited by a motion field generator containing head motion to obtain an output sequence containing extreme head motion. Finally, the natural 3D face animation under extreme head motion is presented in combination with the input audio. The quantitative and qualitative results show that our method outperforms current state-of-the-art methods, and stabilizes the non-area region while maintaining the appearance of extreme head motion.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Our data are primarily from https://voca.is.tue.mpg.de/, and we use the dataset for research purposes only.
References
Aksan E, Kaufmann M, Cao P et al (2021) A spatio-temporal transformer for 3d human motion prediction. 2021 International conference on 3D vision (3DV). IEEE, pp 565–574
Arber S, Hunter JJ, Ross J Jr et al (1997) MLP-deficient mice exhibit a disruption of cardiac cytoarchitectural organization, dilated cardiomyopathy, and heart failure. Cell 88(3):393–403
Baevski A, Hsu W N, Xu Q et al (2022) Data2vec: a general framework for self-supervised learning in speech, vision and language. International conference on machine learning. PMLR, pp 1298–1312
Basak H, Kundu R, Singh PK et al (2022) A union of deep learning and swarm-based optimization for 3D human action recognition. Sci Rep 12(1):5494
Bhattacharya U, Rewkowski N, Banerjee A, et al (2021) Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. 2021 IEEE virtual reality and 3D user interfaces (VR). IEEE, pp 1–10
Busso C, Deng Z, Grimm M et al (2007) Rigid head motion in expressive speech animation: analysis and synthesis. IEEE Trans Audio Speech Lang Process 15(3):1075–1086
Cao Y, Tien WC, Faloutsos P et al (2005) Expressive speech-driven facial animation. ACM Trans Graph (TOG) 24(4):1283–1302
Cao C, Weng Y, Zhou S et al (2013) Facewarehouse: a 3d facial expression database for visual computing. IEEE Trans vis Comput Graph 20(3):413–425
Chai Y, Weng Y, Wang L et al (2022) Speech-driven facial animation with spectral gathering and temporal attention. Front Comput Sci 16(3):1–10
Chang Y, Vieira M, Turk M et al (2005) Automatic 3D facial expression analysis in videos. Analysis and modelling of faces and gestures: second international workshop, AMFG 2005, Beijing, China, October 16, 2005. Proceedings 2. Springer, Berlin, pp 293–307
Chen L, Maddox RK, Duan Z et al (2019) Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, pp 7832–7841
Chen M, Radford A, Child R et al (2020) Generative pretraining from pixels. International conference on machine learning. PMLR, pp 1691–1703
Cheng S, Kotsia I, Pantic M et al (2018) 4dfab: a large scale 4d database for facial expression analysis and biometric applications. Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City, Utah, pp 5117–5126
Chowdhery A, Narang S, Devlin J et al (2022) Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311
Chung JS, Jamaludin A, Zisserman A (2017) You said that? arXiv preprint arXiv:1705.02966
Cosker D, Krumhuber E, Hilton A (2011) A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling. 2011 international conference on computer vision. IEEE, pp 2296–2303
Cudeiro D, Bolkart T, Laidlaw C et al (2019) Capture, learning, and synthesis of 3D speaking styles. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, pp 10101–10111
Dai Z, Yang Z, Yang Y et al (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860
Dehghani M, Gouws S, Vinyals O et al (2018) Universal transformers. arXiv preprint arXiv:1807.03819
Deng Z, Neumann U, Lewis JP et al (2006) Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Trans vis Comput Graph 12(6):1523–1534
Devlin J, Chang M W, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Fan Y, Lin Z, Saito J et al (2022) FaceFormer: speech-driven 3D facial animation with transformers. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. New Orleans, Louisiana, pp 18770–18780
Fanelli G, Gall J, Romsdorfer H et al (2010) A 3-d audio-visual corpus of affective communication. IEEE Trans Multimed 12(6):591–598
Habibie I, Xu W, Mehta D et al (2021) Learning speech-driven 3d conversational gestures from video. Proceedings of the 21st ACM international conference on intelligent virtual agents. Virtual Event Japan, pp 101–108
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415
Hewage P, Behera A, Trovati M et al (2020) Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput 24(21):16453–16482
Hussain R, Karbhari Y, Ijaz MF et al (2021) Revise-net: exploiting reverse attention mechanism for salient object detection. Remote Sens 13(23):4941
Jonell P, Kucherenko T, Henter GE et al (2020) Let's face it: probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. Proceedings of the 20th ACM international conference on intelligent virtual agents. New York, pp 1–8
Karras T, Aila T, Laine S et al (2017) Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans Graph (TOG) 36(4):1–12
Khan S, Naseer M, Hayat M et al (2022) Transformers in vision: a survey. ACM Comput Surv (CSUR) 54(10s):1–41
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Li T, Bolkart T, Black MJ et al (2017) Learning a model of facial shape and expression from 4D scans. ACM Trans Graph 36(6):194:1-194:17
Li J, Yin Y, Chu H et al (2020) Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171
Li R, Yang S, Ross DA et al (2021) Learn to dance with aist++: music conditioned 3d dance generation. arXiv preprint arXiv:2101.08779, 2(3)
Liu Y, Xu F, Chai J et al (2015) Video-audio driven real-time facial animation. ACM Trans Graph (TOG) 34(6):1–10
Liu J, Hui B, Li K et al (2021) Geometry-guided dense perspective network for speech-driven facial animation. IEEE Trans vis Comput Graph 28:4873–4886
Meyer GP (2021) An alternative probabilistic interpretation of the huber loss. Proceedings of the ieee/cvf conference on computer vision and pattern recognition. virtually, pp 5261–5269
Mittal G, Wang B (2020) Animating face using disentangled audio representations. Proceedings of the IEEE/CVF winter conference on applications of computer vision. Snowmass village, Colorado, pp 3290–3298.
Panayotov V, Chen G, Povey D et al (2015) Librispeech: an asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5206–5210
Petrovich M, Black MJ, Varol G (2021) Action-conditioned 3D human motion synthesis with transformer VAE. Proceedings of the IEEE/CVF international conference on computer vision. virtually, pp 10985–10995
Pham H X, Cheung S, Pavlovic V (2017) Speech-driven 3D facial animation with implicit emotional awareness: a deep learning approach. Proceedings of the IEEE conference on computer vision and pattern recognition workshops. Hawaii Convention Center, pp 80–88
Press O, Smith NA, Lewis M (2021) Train short, test long: attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409
Radford A, Narasimhan K, Salimans T et al (2018) Improving language understanding by generative pre-training. 1–12.
Richard A, Zollhöfer M, Wen Y et al (2021) Meshtalk: 3d face animation from speech using cross-modality disentanglement. Proceedings of the IEEE/CVF international conference on computer vision. virtually, pp 1173–1182.
Sadoughi N, Busso C (2016) Head motion generation with synthetic speech: a data driven approach. INTERSPEECH. San Francisco, USA, pp 52–56.
Sadoughi N, Busso C (2018) Novel realizations of speech-driven head movements with generative adversarial networks. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6169–6173
Sadoughi N, Liu Y, Busso C (2017) Meaningful head movements driven by emotional synthetic speech. Speech Commun 95:87–99
Sahoo KK, Dutta I, Ijaz MF et al (2021) TLEFuzzyNet: fuzzy rank-based ensemble of transfer learning models for emotion recognition from human speeches. IEEE Access 9:166518–166530
Savran A, Alyüz N, Dibeklioğlu H et al (2008) Bosphorus database for 3D face analysis//Biometrics and Identity Management: First European Workshop, BIOID 2008, Roskilde, Denmark, May 7-9, 2008. Revised Selected Papers 1. Springer, Berlin, pp 47–56
Su J, Lu Y, Pan S et al (2021) Roformer: enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864
Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017) Synthesizing obama: learning lip sync from audio. ACM Trans Graph (ToG) 36(4):1–13
Taylor S, Kato A, Milner B, Matthews I (2016) Audio-to-visual speech conversion using deep neural networks. In: Proceedings of the interspeech conference 2016. International Speech Communication Association, USA, pp 1482–1486
Taylor S, Kim T, Yue Y et al (2017) A deep learning approach for generalized speech animation. ACM Trans Graph (TOG) 36(4):1–11
Touvron H, Cord M, Douze M et al (2021) Training data-efficient image transformers and distillation through attention. International conference on machine learning. PMLR, pp 10347–10357
Valle-Pérez G, Henter GE, Beskow J et al (2021) Transflower: probabilistic autoregressive dance generation with multimodal attention. ACM Trans Graph (TOG) 40(6):1–14
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need[J]. Adv Neural Inf Process Syst (NIPS) 30:5998–6008
Vlasic D et al (2006) Face transfer with multilinear models. ACM SIGGRAPH 2006 Courses. 24-es
Wang B, Komatsuzaki A (2021) GPT-J-6B: A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax
Wang Q, Fan Z, Xia S (2021a) 3D-TalkEmo: learning to synthesize 3D emotional talking head. arXiv preprint arXiv:2104.12051
Wang S, Li L, Ding Y et al (2021b) Audio2head: audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293
Wiles O, Koepke A, Zisserman A (2018) X2face: a network for controlling face generation using images, audio, and pose codes. Proceedings of the European conference on computer vision (ECCV). Munich, Germany, pp 670–686.
Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim Res 30(1):79–82
Yin L, Wei X, Sun Y et al (2006) A 3D facial expression database for facial behavior research. 7th international conference on automatic face and gesture recognition (FGR06). IEEE, pp 211–216
Zeng D, Liu H, Lin H, et al (2020) Talking face generation with expression-tailored generative adversarial network. Proceedings of the 28th ACM international conference on multimedia. Seattle WA USA, pp 1716–1724
Zhag Y, Wei W (2012) A realistic dynamic facial expression transfer method. Neurocomputing 89:21–29
Zhang X, Yin L, Cohn JF et al (2013) A high-resolution spontaneous 3d dynamic facial expression database. 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, pp 1–6
Zhang Z, Girard J M, Wu Y, et al (2016) Multimodal spontaneous emotion corpus for human behavior analysis. Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, Nevada, pp 3438–3446
Zhang C, Ni S, Fan Z et al (2021) 3d talking face with personalized pose dynamics. IEEE Trans Vis Comput Graph 29(2):1438–1449
Zhou Y, Xu Z, Landreth C et al (2018) Visemenet: audio-driven animator-centric speech animation. ACM Trans Graph (TOG) 37(4):1–10
Zhou H, Liu Y, Liu Z et al (2019) Talking face generation by adversarially disentangled audio-visual representation. Proc AAAI Conf Artif Intell 33(01):9299–9306
Funding
Open access funding provided by Hunan International Economics University School-level Educational Reform Project [2022](64).
Author information
Authors and Affiliations
Contributions
All the authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by DY, RL, QY, YP, XH and JZ. The first draft of the manuscript was written by DY, and all the authors commented on previous versions of the manuscript. All the authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interests regarding the publication of this paper.
Ethical considerations
Our proposed method can synthesize facial animation for anyone based on the input 3D facial data and speech. This can be widely used in several scenarios, such as virtual reality and human–computer interaction. This makes this forward-looking technology potentially open to misuse. We, therefore, want to improve users’ insight into the potential misuse risks and encourage the public to want to report any suspicious videos to the relevant authorities.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, D., Li, R., Yang, Q. et al. 3D head-talk: speech synthesis 3D head movement face animation. Soft Comput 28, 363–379 (2024). https://doi.org/10.1007/s00500-023-09292-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-023-09292-5