[go: up one dir, main page]

Skip to main content

Advertisement

Log in

3D head-talk: speech synthesis 3D head movement face animation

  • Data analytics and machine learning
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Speech-driven 3D human face animation has made admirable progress. However, synthesizing 3D facial speakers with head motion is still an unsolved problem. This is because head motion, as a speech-independent appearance representation, is difficult to model by a speech-driven approach. To solve this problem, we propose 3D head-talk, which generates 3D face animations combined with extreme head motion. In this work, we face a key challenge to generate natural head movements that match the speech rhythm. We first form an end-to-end autoregressive model by combining a dual-tower and single-tower Transformer, with a speech encoder encoding the long-term audio environment, a facial grid encoder encoding subtle changes in the vertices of the 3D facial grid, and a single-tower decoder automatically regressing to predict a series of 3D facial animation grids. Next, the predicted 3D facial animation sequence is edited by a motion field generator containing head motion to obtain an output sequence containing extreme head motion. Finally, the natural 3D face animation under extreme head motion is presented in combination with the input audio. The quantitative and qualitative results show that our method outperforms current state-of-the-art methods, and stabilizes the non-area region while maintaining the appearance of extreme head motion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

Our data are primarily from https://voca.is.tue.mpg.de/, and we use the dataset for research purposes only.

References

  • Aksan E, Kaufmann M, Cao P et al (2021) A spatio-temporal transformer for 3d human motion prediction. 2021 International conference on 3D vision (3DV). IEEE, pp 565–574

  • Arber S, Hunter JJ, Ross J Jr et al (1997) MLP-deficient mice exhibit a disruption of cardiac cytoarchitectural organization, dilated cardiomyopathy, and heart failure. Cell 88(3):393–403

    Article  Google Scholar 

  • Baevski A, Hsu W N, Xu Q et al (2022) Data2vec: a general framework for self-supervised learning in speech, vision and language. International conference on machine learning. PMLR, pp 1298–1312

  • Basak H, Kundu R, Singh PK et al (2022) A union of deep learning and swarm-based optimization for 3D human action recognition. Sci Rep 12(1):5494

    Article  Google Scholar 

  • Bhattacharya U, Rewkowski N, Banerjee A, et al (2021) Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. 2021 IEEE virtual reality and 3D user interfaces (VR). IEEE, pp 1–10

  • Busso C, Deng Z, Grimm M et al (2007) Rigid head motion in expressive speech animation: analysis and synthesis. IEEE Trans Audio Speech Lang Process 15(3):1075–1086

    Article  Google Scholar 

  • Cao Y, Tien WC, Faloutsos P et al (2005) Expressive speech-driven facial animation. ACM Trans Graph (TOG) 24(4):1283–1302

    Article  Google Scholar 

  • Cao C, Weng Y, Zhou S et al (2013) Facewarehouse: a 3d facial expression database for visual computing. IEEE Trans vis Comput Graph 20(3):413–425

    Google Scholar 

  • Chai Y, Weng Y, Wang L et al (2022) Speech-driven facial animation with spectral gathering and temporal attention. Front Comput Sci 16(3):1–10

    Article  Google Scholar 

  • Chang Y, Vieira M, Turk M et al (2005) Automatic 3D facial expression analysis in videos. Analysis and modelling of faces and gestures: second international workshop, AMFG 2005, Beijing, China, October 16, 2005. Proceedings 2. Springer, Berlin, pp 293–307

  • Chen L, Maddox RK, Duan Z et al (2019) Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, pp 7832–7841

  • Chen M, Radford A, Child R et al (2020) Generative pretraining from pixels. International conference on machine learning. PMLR, pp 1691–1703

  • Cheng S, Kotsia I, Pantic M et al (2018) 4dfab: a large scale 4d database for facial expression analysis and biometric applications. Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City, Utah, pp 5117–5126

  • Chowdhery A, Narang S, Devlin J et al (2022) Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

  • Chung JS, Jamaludin A, Zisserman A (2017) You said that? arXiv preprint arXiv:1705.02966

  • Cosker D, Krumhuber E, Hilton A (2011) A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling. 2011 international conference on computer vision. IEEE, pp 2296–2303

  • Cudeiro D, Bolkart T, Laidlaw C et al (2019) Capture, learning, and synthesis of 3D speaking styles. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, pp 10101–10111

  • Dai Z, Yang Z, Yang Y et al (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

  • Dehghani M, Gouws S, Vinyals O et al (2018) Universal transformers. arXiv preprint arXiv:1807.03819

  • Deng Z, Neumann U, Lewis JP et al (2006) Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Trans vis Comput Graph 12(6):1523–1534

    Article  Google Scholar 

  • Devlin J, Chang M W, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  • Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  • Fan Y, Lin Z, Saito J et al (2022) FaceFormer: speech-driven 3D facial animation with transformers. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. New Orleans, Louisiana, pp 18770–18780

  • Fanelli G, Gall J, Romsdorfer H et al (2010) A 3-d audio-visual corpus of affective communication. IEEE Trans Multimed 12(6):591–598

    Article  Google Scholar 

  • Habibie I, Xu W, Mehta D et al (2021) Learning speech-driven 3d conversational gestures from video. Proceedings of the 21st ACM international conference on intelligent virtual agents. Virtual Event Japan, pp 101–108

  • Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

  • Hewage P, Behera A, Trovati M et al (2020) Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput 24(21):16453–16482

    Article  Google Scholar 

  • Hussain R, Karbhari Y, Ijaz MF et al (2021) Revise-net: exploiting reverse attention mechanism for salient object detection. Remote Sens 13(23):4941

    Article  Google Scholar 

  • Jonell P, Kucherenko T, Henter GE et al (2020) Let's face it: probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. Proceedings of the 20th ACM international conference on intelligent virtual agents. New York, pp 1–8

  • Karras T, Aila T, Laine S et al (2017) Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans Graph (TOG) 36(4):1–12

    Article  Google Scholar 

  • Khan S, Naseer M, Hayat M et al (2022) Transformers in vision: a survey. ACM Comput Surv (CSUR) 54(10s):1–41

    Article  Google Scholar 

  • Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  • Li T, Bolkart T, Black MJ et al (2017) Learning a model of facial shape and expression from 4D scans. ACM Trans Graph 36(6):194:1-194:17

    Article  Google Scholar 

  • Li J, Yin Y, Chu H et al (2020) Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171

  • Li R, Yang S, Ross DA et al (2021) Learn to dance with aist++: music conditioned 3d dance generation. arXiv preprint arXiv:2101.08779, 2(3)

  • Liu Y, Xu F, Chai J et al (2015) Video-audio driven real-time facial animation. ACM Trans Graph (TOG) 34(6):1–10

    Article  Google Scholar 

  • Liu J, Hui B, Li K et al (2021) Geometry-guided dense perspective network for speech-driven facial animation. IEEE Trans vis Comput Graph 28:4873–4886

    Article  Google Scholar 

  • Meyer GP (2021) An alternative probabilistic interpretation of the huber loss. Proceedings of the ieee/cvf conference on computer vision and pattern recognition. virtually, pp 5261–5269

  • Mittal G, Wang B (2020) Animating face using disentangled audio representations. Proceedings of the IEEE/CVF winter conference on applications of computer vision. Snowmass village, Colorado, pp 3290–3298.

  • Panayotov V, Chen G, Povey D et al (2015) Librispeech: an asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5206–5210

  • Petrovich M, Black MJ, Varol G (2021) Action-conditioned 3D human motion synthesis with transformer VAE. Proceedings of the IEEE/CVF international conference on computer vision. virtually, pp 10985–10995

  • Pham H X, Cheung S, Pavlovic V (2017) Speech-driven 3D facial animation with implicit emotional awareness: a deep learning approach. Proceedings of the IEEE conference on computer vision and pattern recognition workshops. Hawaii Convention Center, pp 80–88

  • Press O, Smith NA, Lewis M (2021) Train short, test long: attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409

  • Radford A, Narasimhan K, Salimans T et al (2018) Improving language understanding by generative pre-training. 1–12.

  • Richard A, Zollhöfer M, Wen Y et al (2021) Meshtalk: 3d face animation from speech using cross-modality disentanglement. Proceedings of the IEEE/CVF international conference on computer vision. virtually, pp 1173–1182.

  • Sadoughi N, Busso C (2016) Head motion generation with synthetic speech: a data driven approach. INTERSPEECH. San Francisco, USA, pp 52–56.

  • Sadoughi N, Busso C (2018) Novel realizations of speech-driven head movements with generative adversarial networks. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6169–6173

  • Sadoughi N, Liu Y, Busso C (2017) Meaningful head movements driven by emotional synthetic speech. Speech Commun 95:87–99

    Article  Google Scholar 

  • Sahoo KK, Dutta I, Ijaz MF et al (2021) TLEFuzzyNet: fuzzy rank-based ensemble of transfer learning models for emotion recognition from human speeches. IEEE Access 9:166518–166530

    Article  Google Scholar 

  • Savran A, Alyüz N, Dibeklioğlu H et al (2008) Bosphorus database for 3D face analysis//Biometrics and Identity Management: First European Workshop, BIOID 2008, Roskilde, Denmark, May 7-9, 2008. Revised Selected Papers 1. Springer, Berlin, pp 47–56

    Google Scholar 

  • Su J, Lu Y, Pan S et al (2021) Roformer: enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864

  • Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017) Synthesizing obama: learning lip sync from audio. ACM Trans Graph (ToG) 36(4):1–13

    Article  Google Scholar 

  • Taylor S, Kato A, Milner B, Matthews I (2016) Audio-to-visual speech conversion using deep neural networks. In: Proceedings of the interspeech conference 2016. International Speech Communication Association, USA, pp 1482–1486

    Chapter  Google Scholar 

  • Taylor S, Kim T, Yue Y et al (2017) A deep learning approach for generalized speech animation. ACM Trans Graph (TOG) 36(4):1–11

    Article  Google Scholar 

  • Touvron H, Cord M, Douze M et al (2021) Training data-efficient image transformers and distillation through attention. International conference on machine learning. PMLR, pp 10347–10357

  • Valle-Pérez G, Henter GE, Beskow J et al (2021) Transflower: probabilistic autoregressive dance generation with multimodal attention. ACM Trans Graph (TOG) 40(6):1–14

    Article  Google Scholar 

  • Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need[J]. Adv Neural Inf Process Syst (NIPS) 30:5998–6008

    Google Scholar 

  • Vlasic D et al (2006) Face transfer with multilinear models. ACM SIGGRAPH 2006 Courses. 24-es

  • Wang B, Komatsuzaki A (2021) GPT-J-6B: A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax

  • Wang Q, Fan Z, Xia S (2021a) 3D-TalkEmo: learning to synthesize 3D emotional talking head. arXiv preprint arXiv:2104.12051

  • Wang S, Li L, Ding Y et al (2021b) Audio2head: audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293

  • Wiles O, Koepke A, Zisserman A (2018) X2face: a network for controlling face generation using images, audio, and pose codes. Proceedings of the European conference on computer vision (ECCV). Munich, Germany, pp 670–686.

  • Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim Res 30(1):79–82

    Article  Google Scholar 

  • Yin L, Wei X, Sun Y et al (2006) A 3D facial expression database for facial behavior research. 7th international conference on automatic face and gesture recognition (FGR06). IEEE, pp 211–216

  • Zeng D, Liu H, Lin H, et al (2020) Talking face generation with expression-tailored generative adversarial network. Proceedings of the 28th ACM international conference on multimedia. Seattle WA USA, pp 1716–1724

  • Zhag Y, Wei W (2012) A realistic dynamic facial expression transfer method. Neurocomputing 89:21–29

    Article  Google Scholar 

  • Zhang X, Yin L, Cohn JF et al (2013) A high-resolution spontaneous 3d dynamic facial expression database. 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, pp 1–6

  • Zhang Z, Girard J M, Wu Y, et al (2016) Multimodal spontaneous emotion corpus for human behavior analysis. Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, Nevada, pp 3438–3446

  • Zhang C, Ni S, Fan Z et al (2021) 3d talking face with personalized pose dynamics. IEEE Trans Vis Comput Graph 29(2):1438–1449

    Article  Google Scholar 

  • Zhou Y, Xu Z, Landreth C et al (2018) Visemenet: audio-driven animator-centric speech animation. ACM Trans Graph (TOG) 37(4):1–10

    Article  Google Scholar 

  • Zhou H, Liu Y, Liu Z et al (2019) Talking face generation by adversarially disentangled audio-visual representation. Proc AAAI Conf Artif Intell 33(01):9299–9306

    Google Scholar 

Download references

Funding

Open access funding provided by Hunan International Economics University School-level Educational Reform Project [2022](64).

Author information

Authors and Affiliations

Authors

Contributions

All the authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by DY, RL, QY, YP, XH and JZ. The first draft of the manuscript was written by DY, and all the authors commented on previous versions of the manuscript. All the authors read and approved the final manuscript.

Corresponding author

Correspondence to Daowu Yang.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Ethical considerations

Our proposed method can synthesize facial animation for anyone based on the input 3D facial data and speech. This can be widely used in several scenarios, such as virtual reality and human–computer interaction. This makes this forward-looking technology potentially open to misuse. We, therefore, want to improve users’ insight into the potential misuse risks and encourage the public to want to report any suspicious videos to the relevant authorities.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, D., Li, R., Yang, Q. et al. 3D head-talk: speech synthesis 3D head movement face animation. Soft Comput 28, 363–379 (2024). https://doi.org/10.1007/s00500-023-09292-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-023-09292-5

Keywords

Navigation