Abstract
We introduce Iterative Perceptual Learning (IPL), a novel approach to learn computational models for social behavior synthesis from corpora of human–human interactions. IPL combines perceptual evaluation with iterative model refinement. Human observers rate the appropriateness of synthesized behaviors in the context of a conversation. These ratings are used to refine the machine learning models that predict the social signal timings. As the ratings correspond to those moments in the conversation where the production of a specific behavior is inappropriate, we regard features extracted at these moments as negative samples for the training of a classifier. This is an advantage over the traditional corpus-based approach to extract negative samples at random non-positive moments. We perform a comparison between IPL and the traditional corpus-based approach on the timing of backchannels for a listener in speaker–listener dialogs. While both models perform similarly in terms of precision and recall scores, there is a tendency that the backchannels generated with IPL are rated as more appropriate. We additionally investigate the effect of the amount of available training data and the variation of training data on the outcome of the models.
Similar content being viewed by others
References
Bavelas JB, Coates L, Johnson T (2002) Listener responses as a collaborative process: the role of gaze. J Commun 52(3):566–580
Bertrand R, Ferré G, Blache P, Espesser R, Rauzy S (2007) Backchannels revisited from a multimodal perspective. In: Proceedings of auditory-visual speech processing. Hilvarenbeek, The Netherlands, pp 1–5
Bevacqua E, Hyniewska S, Pelachaud C (2010) Positive influence of smile backchannels in ECAs. In: Proceedings of the workshop on interacting with ECAs as virtual characters at the international joint conference on autonomous agents and multi-agent systems (AAMAS), Toronto, Canada
Cathcart N, Carletta J, Klein E (2003) A shallow model of backchannel continuers in spoken dialogue. In: Proceedings of the conference of the European chapter of the Association for Computational Linguistics, vol 1, Budapest, Hungary, pp 51–58
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27
Clark HH (1996) Using language. Cambridge University Press, Cambridge
Dittmann AT, Llewellyn LG (1967) The phonemic clause as a unit of speech decoding. J Personal Social Psychol 6(3):341–349
Duncan S Jr (1972) Some signals and rules for taking speaking turns in conversations. J Personal Social Psychol 23(2):283–292
Eyben F, Wöllmer M, Schuller B (2009) OpenEAR–introducing the Munich open-source emotion and affect recognition toolkit. In: Affective computing and intelligent interaction (ACII). Amsterdam, The Netherlands, pp 576–581
Gravano A, Hirschberg J (2009) Backchannel-inviting cues in task-oriented dialogue. In: Proceedings of interspeech. Brighton, UK, pp 1019–1022
Heylen D, Bevacqua E, Pelachaud C, Poggi I, Gratch J, Schröder M (2011) Generating listening behaviour. In: Cowie R, Pelachaud C, Petta P (eds) Emotion-oriented systems. Springer, Berlin, pp 321–347
Huang L, Morency LP, Gratch J (2010) Parasocial consensus sampling: combining multiple perspectives to learn virtual human behavior. In: Proceedings of autonomous agents and multi-agent systems (AAMAS), Toronto, Canada, pp 1265–1272
Huang L, Morency LP, Gratch J (2011) Virtual rapport 2.0. In: Proceedings of intelligent virtual agents (IVA). Reykjavík, Iceland, pp 68–79
Huijbregts M (2008) Segmentation, diarization and speech transcription: surprise data unraveled. Phd thesis, University of Twente
Kendon A (1967) Some functions of gaze direction in social interaction. Acta Psychol 26(1):22–63
de Kok I, Heylen D (2011) Appropriate and inappropriate timing of listener responses from multiple perspectives. In: Proceedings of intelligent virtual agents (IVA). Reykjavík, Iceland, pp 248–254
de Kok I, Heylen D (2011) The MultiLis corpus–dealing with individual differences of nonverbal listening behavior. In: Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues. Springer, Berlin, pp 374–387
de Kok I, Ozkan D, Heylen D, Morency LP (2010) Learning and evaluating response prediction models using parallel listener consensus. In: Proceeding of international conference on multimodal interfaces and the workshop on machine learning for multimodal interaction (ICMI-MLMI). Beijing, China, p 3
Lee J, Neviarouskaya A, Prendinger H, Marsella S (2009) Learning models of speaker head nods with affective information. In: Proceedings of intelligent virtual agents (IVA). Amsterdam, The Netherlands, pp 1–6
Martin JC, Paggio P, Kuehnlein P, Stiefelhagen R, Pianesi F (2008) Introduction to the special issue on multimodal corpora for modeling human multimodal behavior. Lang Resour Eval 42(2):253–264
Morency LP, de Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agents Multi-Agent Syst 20(1):80–84
Nishimura R, Kitaoka N, Nakagawa S (2007), A spoken dialog system for chat-like conversations considering response timing. In: Proceedings of text, speech and dialogue (TSD). Plzen, Czech Republic, pp 599–606
Okato Y, Kato K, Yamamoto M, Itahashi S (1996) Insertion of interjectory response based on prosodic information. In: Proceedings of the IEEE workshop on interactive voice technology for telecommunication applications. Basking Ridge, NJ, pp 85–88
Peters C, Pelachaud C, Bevacqua E, Mancini M, Poggi I (2005) A model of attention and interest using gaze behavior. In: Proceedings of intelligent virtual agents (IVA). Kos, Greece, pp 229–240
Poppe R, ter Maat M, Heylen D (2012) Online behavior evaluation with the Switching Wizard of Oz. In: Proceedings of intelligent virtual agents (IVA), Santa Cruz, CA, pp 486–488
Poppe R, Truong KP, Heylen D (2011) Backchannels: quantity, type and timing matters. In: Proceedings of intelligent virtual agents (IVA), pp. 228–239. Reykjavík, Iceland
Poppe R, Truong KP, Reidsma D, Heylen D (2010) Backchannel strategies for artificial listeners. In: Proceedings of intelligent virtual agents (IVA), pp 146–158
Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66
Traum D, DeVault D, Lee J, Wang Z, Marsella S (2012) Incremental dialogue understanding and feedback for multiparty, multimodal conversation. In: Proceedings of intelligent virtual agents (IVA), Santa Cruz, CA, pp 275–288
Truong KP, Poppe R, Kok I, Heylen D (2011) A multimodal analysis of vocal and visual backchannels in spontaneous dialogs. In: Proceedings of interspeech, Florence, Italy, pp 2973–2976
Vinciarelli A, Pantic M, Heylen D, Pelachaud C, Poggi I, D’Errico F, Schröder M (2012) Bridging the gap between social animal and unsocial machine: a survey of social signal processing. IEEE Trans Affect Comput 3(1):69–87
Ward N, Tsukahara W (2000) Prosodic features which cue back-channel responses in English and Japanese. J Pragmat 32(8):1177–1207
van Welbergen H, Reidsma D, Ruttkay ZM, Zwiers J (2010) Elckerlyc—a BML realizer for continuous, multimodal interaction with a virtual human. J Multimod User Interf 3(4):271–284
Xudong D (2009) Listener response. In: The pragmatics of interaction. John Benjamins Publishing, Amsterdam, pp 104–124
Yngve VH (1970) On getting a word in edgewise. In: Sixth regional meeting of the Chicago Linguistic Society, vol 6, pp 657–677
Author information
Authors and Affiliations
Corresponding author
Additional information
This publication was supported by the Dutch national program COMMIT and the EU FP7 project SSPNet. We would like to thank the anonymous reviewers for their constructive feedback, which helped us to improve the paper.
Rights and permissions
About this article
Cite this article
de Kok, I., Poppe, R. & Heylen, D. Iterative perceptual learning for social behavior synthesis. J Multimodal User Interfaces 8, 231–241 (2014). https://doi.org/10.1007/s12193-013-0132-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-013-0132-1