3D head-talk: speech synthesis 3D head movement face animation

Daowu Yang ORCID: orcid.org/0000-0003-3876-5722¹,
Ruihui Li²,
Qi Yang³,
Yuyi Peng¹,
Xibei Huang¹ &
…
Jing Zou¹

538 Accesses
Explore all metrics

Abstract

Speech-driven 3D human face animation has made admirable progress. However, synthesizing 3D facial speakers with head motion is still an unsolved problem. This is because head motion, as a speech-independent appearance representation, is difficult to model by a speech-driven approach. To solve this problem, we propose 3D head-talk, which generates 3D face animations combined with extreme head motion. In this work, we face a key challenge to generate natural head movements that match the speech rhythm. We first form an end-to-end autoregressive model by combining a dual-tower and single-tower Transformer, with a speech encoder encoding the long-term audio environment, a facial grid encoder encoding subtle changes in the vertices of the 3D facial grid, and a single-tower decoder automatically regressing to predict a series of 3D facial animation grids. Next, the predicted 3D facial animation sequence is edited by a motion field generator containing head motion to obtain an output sequence containing extreme head motion. Finally, the natural 3D face animation under extreme head motion is presented in combination with the input audio. The quantitative and qualitative results show that our method outperforms current state-of-the-art methods, and stabilizes the non-area region while maintaining the appearance of extreme head motion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

S $$^{3}$$ D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

Talking-Head Generation with Rhythmic Head Motion

Modular Joint Training for Speech-Driven 3D Facial Animation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

Our data are primarily from https://voca.is.tue.mpg.de/, and we use the dataset for research purposes only.

References

Aksan E, Kaufmann M, Cao P et al (2021) A spatio-temporal transformer for 3d human motion prediction. 2021 International conference on 3D vision (3DV). IEEE, pp 565–574
Arber S, Hunter JJ, Ross J Jr et al (1997) MLP-deficient mice exhibit a disruption of cardiac cytoarchitectural organization, dilated cardiomyopathy, and heart failure. Cell 88(3):393–403
Article Google Scholar
Baevski A, Hsu W N, Xu Q et al (2022) Data2vec: a general framework for self-supervised learning in speech, vision and language. International conference on machine learning. PMLR, pp 1298–1312
Basak H, Kundu R, Singh PK et al (2022) A union of deep learning and swarm-based optimization for 3D human action recognition. Sci Rep 12(1):5494
Article Google Scholar
Bhattacharya U, Rewkowski N, Banerjee A, et al (2021) Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. 2021 IEEE virtual reality and 3D user interfaces (VR). IEEE, pp 1–10
Busso C, Deng Z, Grimm M et al (2007) Rigid head motion in expressive speech animation: analysis and synthesis. IEEE Trans Audio Speech Lang Process 15(3):1075–1086
Article Google Scholar
Cao Y, Tien WC, Faloutsos P et al (2005) Expressive speech-driven facial animation. ACM Trans Graph (TOG) 24(4):1283–1302
Article Google Scholar
Cao C, Weng Y, Zhou S et al (2013) Facewarehouse: a 3d facial expression database for visual computing. IEEE Trans vis Comput Graph 20(3):413–425
Google Scholar
Chai Y, Weng Y, Wang L et al (2022) Speech-driven facial animation with spectral gathering and temporal attention. Front Comput Sci 16(3):1–10
Article Google Scholar
Chang Y, Vieira M, Turk M et al (2005) Automatic 3D facial expression analysis in videos. Analysis and modelling of faces and gestures: second international workshop, AMFG 2005, Beijing, China, October 16, 2005. Proceedings 2. Springer, Berlin, pp 293–307
Chen L, Maddox RK, Duan Z et al (2019) Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, pp 7832–7841
Chen M, Radford A, Child R et al (2020) Generative pretraining from pixels. International conference on machine learning. PMLR, pp 1691–1703
Cheng S, Kotsia I, Pantic M et al (2018) 4dfab: a large scale 4d database for facial expression analysis and biometric applications. Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City, Utah, pp 5117–5126
Chowdhery A, Narang S, Devlin J et al (2022) Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311
Chung JS, Jamaludin A, Zisserman A (2017) You said that? arXiv preprint arXiv:1705.02966
Cosker D, Krumhuber E, Hilton A (2011) A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling. 2011 international conference on computer vision. IEEE, pp 2296–2303
Cudeiro D, Bolkart T, Laidlaw C et al (2019) Capture, learning, and synthesis of 3D speaking styles. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, pp 10101–10111
Dai Z, Yang Z, Yang Y et al (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860
Dehghani M, Gouws S, Vinyals O et al (2018) Universal transformers. arXiv preprint arXiv:1807.03819
Deng Z, Neumann U, Lewis JP et al (2006) Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Trans vis Comput Graph 12(6):1523–1534
Article Google Scholar
Devlin J, Chang M W, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Fan Y, Lin Z, Saito J et al (2022) FaceFormer: speech-driven 3D facial animation with transformers. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. New Orleans, Louisiana, pp 18770–18780
Fanelli G, Gall J, Romsdorfer H et al (2010) A 3-d audio-visual corpus of affective communication. IEEE Trans Multimed 12(6):591–598
Article Google Scholar
Habibie I, Xu W, Mehta D et al (2021) Learning speech-driven 3d conversational gestures from video. Proceedings of the 21st ACM international conference on intelligent virtual agents. Virtual Event Japan, pp 101–108
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415
Hewage P, Behera A, Trovati M et al (2020) Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput 24(21):16453–16482
Article Google Scholar
Hussain R, Karbhari Y, Ijaz MF et al (2021) Revise-net: exploiting reverse attention mechanism for salient object detection. Remote Sens 13(23):4941
Article Google Scholar
Jonell P, Kucherenko T, Henter GE et al (2020) Let's face it: probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. Proceedings of the 20th ACM international conference on intelligent virtual agents. New York, pp 1–8
Karras T, Aila T, Laine S et al (2017) Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans Graph (TOG) 36(4):1–12
Article Google Scholar
Khan S, Naseer M, Hayat M et al (2022) Transformers in vision: a survey. ACM Comput Surv (CSUR) 54(10s):1–41
Article Google Scholar
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Li T, Bolkart T, Black MJ et al (2017) Learning a model of facial shape and expression from 4D scans. ACM Trans Graph 36(6):194:1-194:17
Article Google Scholar
Li J, Yin Y, Chu H et al (2020) Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171
Li R, Yang S, Ross DA et al (2021) Learn to dance with aist++: music conditioned 3d dance generation. arXiv preprint arXiv:2101.08779, 2(3)
Liu Y, Xu F, Chai J et al (2015) Video-audio driven real-time facial animation. ACM Trans Graph (TOG) 34(6):1–10
Article Google Scholar
Liu J, Hui B, Li K et al (2021) Geometry-guided dense perspective network for speech-driven facial animation. IEEE Trans vis Comput Graph 28:4873–4886
Article Google Scholar
Meyer GP (2021) An alternative probabilistic interpretation of the huber loss. Proceedings of the ieee/cvf conference on computer vision and pattern recognition. virtually, pp 5261–5269
Mittal G, Wang B (2020) Animating face using disentangled audio representations. Proceedings of the IEEE/CVF winter conference on applications of computer vision. Snowmass village, Colorado, pp 3290–3298.
Panayotov V, Chen G, Povey D et al (2015) Librispeech: an asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5206–5210
Petrovich M, Black MJ, Varol G (2021) Action-conditioned 3D human motion synthesis with transformer VAE. Proceedings of the IEEE/CVF international conference on computer vision. virtually, pp 10985–10995
Pham H X, Cheung S, Pavlovic V (2017) Speech-driven 3D facial animation with implicit emotional awareness: a deep learning approach. Proceedings of the IEEE conference on computer vision and pattern recognition workshops. Hawaii Convention Center, pp 80–88
Press O, Smith NA, Lewis M (2021) Train short, test long: attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409
Radford A, Narasimhan K, Salimans T et al (2018) Improving language understanding by generative pre-training. 1–12.
Richard A, Zollhöfer M, Wen Y et al (2021) Meshtalk: 3d face animation from speech using cross-modality disentanglement. Proceedings of the IEEE/CVF international conference on computer vision. virtually, pp 1173–1182.
Sadoughi N, Busso C (2016) Head motion generation with synthetic speech: a data driven approach. INTERSPEECH. San Francisco, USA, pp 52–56.
Sadoughi N, Busso C (2018) Novel realizations of speech-driven head movements with generative adversarial networks. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6169–6173
Sadoughi N, Liu Y, Busso C (2017) Meaningful head movements driven by emotional synthetic speech. Speech Commun 95:87–99
Article Google Scholar
Sahoo KK, Dutta I, Ijaz MF et al (2021) TLEFuzzyNet: fuzzy rank-based ensemble of transfer learning models for emotion recognition from human speeches. IEEE Access 9:166518–166530
Article Google Scholar
Savran A, Alyüz N, Dibeklioğlu H et al (2008) Bosphorus database for 3D face analysis//Biometrics and Identity Management: First European Workshop, BIOID 2008, Roskilde, Denmark, May 7-9, 2008. Revised Selected Papers 1. Springer, Berlin, pp 47–56
Google Scholar
Su J, Lu Y, Pan S et al (2021) Roformer: enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864
Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017) Synthesizing obama: learning lip sync from audio. ACM Trans Graph (ToG) 36(4):1–13
Article Google Scholar
Taylor S, Kato A, Milner B, Matthews I (2016) Audio-to-visual speech conversion using deep neural networks. In: Proceedings of the interspeech conference 2016. International Speech Communication Association, USA, pp 1482–1486
Chapter Google Scholar
Taylor S, Kim T, Yue Y et al (2017) A deep learning approach for generalized speech animation. ACM Trans Graph (TOG) 36(4):1–11
Article Google Scholar
Touvron H, Cord M, Douze M et al (2021) Training data-efficient image transformers and distillation through attention. International conference on machine learning. PMLR, pp 10347–10357
Valle-Pérez G, Henter GE, Beskow J et al (2021) Transflower: probabilistic autoregressive dance generation with multimodal attention. ACM Trans Graph (TOG) 40(6):1–14
Article Google Scholar
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need[J]. Adv Neural Inf Process Syst (NIPS) 30:5998–6008
Google Scholar
Vlasic D et al (2006) Face transfer with multilinear models. ACM SIGGRAPH 2006 Courses. 24-es
Wang B, Komatsuzaki A (2021) GPT-J-6B: A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax
Wang Q, Fan Z, Xia S (2021a) 3D-TalkEmo: learning to synthesize 3D emotional talking head. arXiv preprint arXiv:2104.12051
Wang S, Li L, Ding Y et al (2021b) Audio2head: audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293
Wiles O, Koepke A, Zisserman A (2018) X2face: a network for controlling face generation using images, audio, and pose codes. Proceedings of the European conference on computer vision (ECCV). Munich, Germany, pp 670–686.
Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim Res 30(1):79–82
Article Google Scholar
Yin L, Wei X, Sun Y et al (2006) A 3D facial expression database for facial behavior research. 7th international conference on automatic face and gesture recognition (FGR06). IEEE, pp 211–216
Zeng D, Liu H, Lin H, et al (2020) Talking face generation with expression-tailored generative adversarial network. Proceedings of the 28th ACM international conference on multimedia. Seattle WA USA, pp 1716–1724
Zhag Y, Wei W (2012) A realistic dynamic facial expression transfer method. Neurocomputing 89:21–29
Article Google Scholar
Zhang X, Yin L, Cohn JF et al (2013) A high-resolution spontaneous 3d dynamic facial expression database. 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, pp 1–6
Zhang Z, Girard J M, Wu Y, et al (2016) Multimodal spontaneous emotion corpus for human behavior analysis. Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, Nevada, pp 3438–3446
Zhang C, Ni S, Fan Z et al (2021) 3d talking face with personalized pose dynamics. IEEE Trans Vis Comput Graph 29(2):1438–1449
Article Google Scholar
Zhou Y, Xu Z, Landreth C et al (2018) Visemenet: audio-driven animator-centric speech animation. ACM Trans Graph (TOG) 37(4):1–10
Article Google Scholar
Zhou H, Liu Y, Liu Z et al (2019) Talking face generation by adversarially disentangled audio-visual representation. Proc AAAI Conf Artif Intell 33(01):9299–9306
Google Scholar

Download references

Funding

Open access funding provided by Hunan International Economics University School-level Educational Reform Project [2022](64).

Author information

Authors and Affiliations

Hunan International Economics University, Changsha, 410205, Hunan, China
Daowu Yang, Yuyi Peng, Xibei Huang & Jing Zou
Hunan University, Changsha, 410082, Hunan, China
Ruihui Li
Shaoyang University, Shaoyang, 422099, Hunan, China
Qi Yang

Authors

Daowu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ruihui Li
View author publications
You can also search for this author in PubMed Google Scholar
Qi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yuyi Peng
View author publications
You can also search for this author in PubMed Google Scholar
Xibei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All the authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by DY, RL, QY, YP, XH and JZ. The first draft of the manuscript was written by DY, and all the authors commented on previous versions of the manuscript. All the authors read and approved the final manuscript.

Corresponding author

Correspondence to Daowu Yang.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Ethical considerations

Our proposed method can synthesize facial animation for anyone based on the input 3D facial data and speech. This can be widely used in several scenarios, such as virtual reality and human–computer interaction. This makes this forward-looking technology potentially open to misuse. We, therefore, want to improve users’ insight into the potential misuse risks and encourage the public to want to report any suspicious videos to the relevant authorities.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, D., Li, R., Yang, Q. et al. 3D head-talk: speech synthesis 3D head movement face animation. Soft Comput 28, 363–379 (2024). https://doi.org/10.1007/s00500-023-09292-5

Download citation

Accepted: 19 September 2023
Published: 15 October 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s00500-023-09292-5

3D head-talk: speech synthesis 3D head movement face animation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

S $$^{3}$$ D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

Talking-Head Generation with Rhythmic Head Motion

Modular Joint Training for Speech-Driven 3D Facial Animation

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical considerations

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

3D head-talk: speech synthesis 3D head movement face animation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

S $$^{3}$$ D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

Talking-Head Generation with Rhythmic Head Motion

Modular Joint Training for Speech-Driven 3D Facial Animation

Explore related subjects

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical considerations

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation