Modeling Multimodal Behaviors from Speech Prosody

Yu Ding²³,
Catherine Pelachaud²³ &
Thierry Artières²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8108))

Included in the following conference series:

International Workshop on Intelligent Virtual Agents

3128 Accesses
19 Citations

Abstract

Head and eyebrow movements are an important communication mean. They are highly synchronized with speech prosody. Endowing virtual agent with synchronized verbal and nonverbal behavior enhances their communicative performance. In this paper, we propose an animation model for the virtual agent based on a statistical model linking speech prosody and facial movement. A fully parameterized Hidden Markov Model is proposed first to capture the tight relationship between speech and facial movement of a human face extracted from a video corpus and then to drive automatically virtual agent’s behaviors from speech signals. The correlation between head and eyebrow movements is also taken into account during the building of the model. Subjective and objective evaluations were conducted to validate this model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Cross Modal Evaluation of High Quality Emotional Speech Synthesis with the Virtual Human Toolkit

Virtual EMG via Facial Video Analysis

Learning multimodal behavioral models for face-to-face social interaction

Article 24 July 2015

References

Busso, C., Deng, Z., Neumann, U., Narayanan, S.: Natural head motion synthesis driven by acoustic prosodic features. Journal of Visualization and Computer Animation 16(3-4), 283–290 (2005)
Google Scholar
Bevacqua, E., Prepin, K., Niewiadomski, R., de Sevin, E., Pelachaud, C.: GRETA: Towards an Interactive Conversational Virtual Companion. In: Artificial Companions in Society: Perspectives on the Present and Future, pp. 1–17 (2010)
Google Scholar
Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Bateson, E.V.: Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological Science 15(2), 133–137 (2004)
Article Google Scholar
Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press (2004)
Google Scholar
Ekman, P.: About brows: Emotional and conversational signals. In: von Cranach, M., Foppa, K., Lepenies, W., Ploog, D. (eds.) Human Ethology: Claims and Limits of a New Discipline: Contributions to the Colloquium, pp. 169–248. Cambridge University Press, Cambridge (1979)
Google Scholar
Bolinger, D.: Intonation and Its Uses: Melody in Grammar and Discourse. University Press (1989)
Google Scholar
Pelachaud, C., Badler, N.I., Steedman, M.: Generating facial expressions for speech. Cognitive Science 20, 1–46 (1996)
Article Google Scholar
Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Bechet, T., Douville, B., Prevost, S., Stone, M.: Animated conversation: Ruled-based generation of facial expression gesture and spoken intonation for multiple conversational agents. In: Computer Graphics, pp. 413–420 (1994)
Google Scholar
Beskow, J.: Rule-based visual speech synthesis. In: 4th European Conference on Speech Communication and Technology ESCA-EUROSPEECH 1995, Madrid (September 1995)
Google Scholar
Lee, J., Marsella, S.: Modeling speaker behavior: A comparison of two approaches. In: Nakano, Y., Neff, M., Paiva, A., Walker, M. (eds.) IVA 2012. LNCS, vol. 7502, pp. 161–174. Springer, Heidelberg (2012)
Chapter Google Scholar
Chiu, C.-C., Marsella, S.: How to train your avatar: A data driven approach to gesture generation. In: Vilhjálmsson, H.H., Kopp, S., Marsella, S., Thórisson, K.R. (eds.) IVA 2011. LNCS, vol. 6895, pp. 127–140. Springer, Heidelberg (2011)
Chapter Google Scholar
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 257–286 (1989)
Google Scholar
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for hmm-based speech synthesis. In: ICASSP, pp. 1315–1318 (2000)
Google Scholar
Costa, M., Chen, T., Lavagetto, F.: Visual prosody analysis for realistic motion synthesis of 3d head models. In: Proc. of ICAV3D, pp. 343–346 (2001)
Google Scholar
Dziemianko, M., Hofer, G., Shimodaira, H.: Hmm-based automatic eye-blink synthesis from speech. In: INTERSPEECH, pp. 1799–1802 (2009)
Google Scholar
Hofer, G., Shimodaira, H., Yamagishi, J.: Speech driven head motion synthesis based on a trajectory model. In: ACM SIGGRAPH 2007 Posters (2007)
Google Scholar
Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Trans. on Audio, Speech & Language Processing 15(3), 1075–1086 (2007)
Article Google Scholar
Mariooryad, S., Busso, C.: Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Trans. on Audio, Speech & Language Processing 20(8), 2329–2340 (2012)
Article Google Scholar
Xue, J., Borgstrom, J., Jiang, J., Bernstein, L., Alwan, A.: Acoustically-driven talking face synthesis using dynamic bayesian networks. In: 2006 IEEE International Conference on Multimedia and Expo, pp. 1165–1168 (2006)
Google Scholar
Levine, S., Krähenbühl, P., Thrun, S., Koltun, V.: Gesture controllers. ACM Trans. Graph. 29(4) (2010)
Google Scholar
Ding, Y., Radenen, M., Artières, T., Pelachaud, C.: Speech-driven eyebrow motion synthesis with contextual markovian models. In: ICASSP, pp. 3756–3760 (2013)
Google Scholar
Wilson, A.D., Bobick, A.F.: Parametric hidden markov models for gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 21, 884–900 (1999)
Article Google Scholar
Radenen, M., Artières, T.: Contextual hidden markov models. In: ICASSP, pp. 2113–2116 (2012)
Google Scholar
Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: A 3-D Audio-Visual Corpus of Affective Communication. IEEE Transactions on Multimedia 12(6), 591–598 (2010)
Article Google Scholar
Pandzic, I., Forcheimer, R.: MPEG4 Facial Animation - The standard, implementations and applications. John Wiley & Sons (2002)
Google Scholar
Boersma, P., Weeninck, D.: Praat, a system for doing phonetics by computer. Glot International 5(9/10), 341–345 (2001)
Google Scholar
Lee, J., Marsella, S.: Predicting speaker head nods and the effects of affective information. IEEE Transactions on Multimedia 12(6), 552–562 (2010)
Article Google Scholar
McNeill, D.: Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press, Chicago (1992)
Google Scholar

Download references

Author information

Authors and Affiliations

CNRS-LTCI, Institut Mines-TELECOM, TELECOM ParisTech, Paris, France
Yu Ding & Catherine Pelachaud
Université Pierre et Marie Curie (LIP6), Paris, France
Thierry Artières

Authors

Yu Ding
View author publications
You can also search for this author in PubMed Google Scholar
Catherine Pelachaud
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Artières
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

MACS, Heriot-Watt University, Riccarton, EH14 4AS, Edinburgh, UK
Ruth Aylett
Austrian Research Institute for Artificial Intelligence (OFAI), 1010, Vienna, Austria
Brigitte Krenn
CNRS-LTCI, Telecom-ParisTech, 75014, Paris, France
Catherine Pelachaud
School of Informatics, The University of Edinburgh, EH8 9LW, Edinburgh, UK
Hiroshi Shimodaira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ding, Y., Pelachaud, C., Artières, T. (2013). Modeling Multimodal Behaviors from Speech Prosody. In: Aylett, R., Krenn, B., Pelachaud, C., Shimodaira, H. (eds) Intelligent Virtual Agents. IVA 2013. Lecture Notes in Computer Science(), vol 8108. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40415-3_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-40415-3_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40414-6
Online ISBN: 978-3-642-40415-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Modeling Multimodal Behaviors from Speech Prosody

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Cross Modal Evaluation of High Quality Emotional Speech Synthesis with the Virtual Human Toolkit

Virtual EMG via Facial Video Analysis

Learning multimodal behavioral models for face-to-face social interaction

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Modeling Multimodal Behaviors from Speech Prosody

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Cross Modal Evaluation of High Quality Emotional Speech Synthesis with the Virtual Human Toolkit

Virtual EMG via Facial Video Analysis

Learning multimodal behavioral models for face-to-face social interaction

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation