[go: up one dir, main page]

Skip to main content
Log in

Iterative perceptual learning for social behavior synthesis

  • Original Paper
  • Published:
Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Abstract

We introduce Iterative Perceptual Learning (IPL), a novel approach to learn computational models for social behavior synthesis from corpora of human–human interactions. IPL combines perceptual evaluation with iterative model refinement. Human observers rate the appropriateness of synthesized behaviors in the context of a conversation. These ratings are used to refine the machine learning models that predict the social signal timings. As the ratings correspond to those moments in the conversation where the production of a specific behavior is inappropriate, we regard features extracted at these moments as negative samples for the training of a classifier. This is an advantage over the traditional corpus-based approach to extract negative samples at random non-positive moments. We perform a comparison between IPL and the traditional corpus-based approach on the timing of backchannels for a listener in speaker–listener dialogs. While both models perform similarly in terms of precision and recall scores, there is a tendency that the backchannels generated with IPL are rated as more appropriate. We additionally investigate the effect of the amount of available training data and the variation of training data on the outcome of the models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Bavelas JB, Coates L, Johnson T (2002) Listener responses as a collaborative process: the role of gaze. J Commun 52(3):566–580

    Article  Google Scholar 

  2. Bertrand R, Ferré G, Blache P, Espesser R, Rauzy S (2007) Backchannels revisited from a multimodal perspective. In: Proceedings of auditory-visual speech processing. Hilvarenbeek, The Netherlands, pp 1–5

  3. Bevacqua E, Hyniewska S, Pelachaud C (2010) Positive influence of smile backchannels in ECAs. In: Proceedings of the workshop on interacting with ECAs as virtual characters at the international joint conference on autonomous agents and multi-agent systems (AAMAS), Toronto, Canada

  4. Cathcart N, Carletta J, Klein E (2003) A shallow model of backchannel continuers in spoken dialogue. In: Proceedings of the conference of the European chapter of the Association for Computational Linguistics, vol 1, Budapest, Hungary, pp 51–58

  5. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27

    Article  Google Scholar 

  6. Clark HH (1996) Using language. Cambridge University Press, Cambridge

    Book  Google Scholar 

  7. Dittmann AT, Llewellyn LG (1967) The phonemic clause as a unit of speech decoding. J Personal Social Psychol 6(3):341–349

    Article  Google Scholar 

  8. Duncan S Jr (1972) Some signals and rules for taking speaking turns in conversations. J Personal Social Psychol 23(2):283–292

    Article  Google Scholar 

  9. Eyben F, Wöllmer M, Schuller B (2009) OpenEAR–introducing the Munich open-source emotion and affect recognition toolkit. In: Affective computing and intelligent interaction (ACII). Amsterdam, The Netherlands, pp 576–581

  10. Gravano A, Hirschberg J (2009) Backchannel-inviting cues in task-oriented dialogue. In: Proceedings of interspeech. Brighton, UK, pp 1019–1022

  11. Heylen D, Bevacqua E, Pelachaud C, Poggi I, Gratch J, Schröder M (2011) Generating listening behaviour. In: Cowie R, Pelachaud C, Petta P (eds) Emotion-oriented systems. Springer, Berlin, pp 321–347

    Chapter  Google Scholar 

  12. Huang L, Morency LP, Gratch J (2010) Parasocial consensus sampling: combining multiple perspectives to learn virtual human behavior. In: Proceedings of autonomous agents and multi-agent systems (AAMAS), Toronto, Canada, pp 1265–1272

  13. Huang L, Morency LP, Gratch J (2011) Virtual rapport 2.0. In: Proceedings of intelligent virtual agents (IVA). Reykjavík, Iceland, pp 68–79

  14. Huijbregts M (2008) Segmentation, diarization and speech transcription: surprise data unraveled. Phd thesis, University of Twente

  15. Kendon A (1967) Some functions of gaze direction in social interaction. Acta Psychol 26(1):22–63

    Article  Google Scholar 

  16. de Kok I, Heylen D (2011) Appropriate and inappropriate timing of listener responses from multiple perspectives. In: Proceedings of intelligent virtual agents (IVA). Reykjavík, Iceland, pp 248–254

  17. de Kok I, Heylen D (2011) The MultiLis corpus–dealing with individual differences of nonverbal listening behavior. In: Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues. Springer, Berlin, pp 374–387

  18. de Kok I, Ozkan D, Heylen D, Morency LP (2010) Learning and evaluating response prediction models using parallel listener consensus. In: Proceeding of international conference on multimodal interfaces and the workshop on machine learning for multimodal interaction (ICMI-MLMI). Beijing, China, p 3

  19. Lee J, Neviarouskaya A, Prendinger H, Marsella S (2009) Learning models of speaker head nods with affective information. In: Proceedings of intelligent virtual agents (IVA). Amsterdam, The Netherlands, pp 1–6

  20. Martin JC, Paggio P, Kuehnlein P, Stiefelhagen R, Pianesi F (2008) Introduction to the special issue on multimodal corpora for modeling human multimodal behavior. Lang Resour Eval 42(2):253–264

    Article  Google Scholar 

  21. Morency LP, de Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agents Multi-Agent Syst 20(1):80–84

    Article  Google Scholar 

  22. Nishimura R, Kitaoka N, Nakagawa S (2007), A spoken dialog system for chat-like conversations considering response timing. In: Proceedings of text, speech and dialogue (TSD). Plzen, Czech Republic, pp 599–606

  23. Okato Y, Kato K, Yamamoto M, Itahashi S (1996) Insertion of interjectory response based on prosodic information. In: Proceedings of the IEEE workshop on interactive voice technology for telecommunication applications. Basking Ridge, NJ, pp 85–88

  24. Peters C, Pelachaud C, Bevacqua E, Mancini M, Poggi I (2005) A model of attention and interest using gaze behavior. In: Proceedings of intelligent virtual agents (IVA). Kos, Greece, pp 229–240

  25. Poppe R, ter Maat M, Heylen D (2012) Online behavior evaluation with the Switching Wizard of Oz. In: Proceedings of intelligent virtual agents (IVA), Santa Cruz, CA, pp 486–488

  26. Poppe R, Truong KP, Heylen D (2011) Backchannels: quantity, type and timing matters. In: Proceedings of intelligent virtual agents (IVA), pp. 228–239. Reykjavík, Iceland

  27. Poppe R, Truong KP, Reidsma D, Heylen D (2010) Backchannel strategies for artificial listeners. In: Proceedings of intelligent virtual agents (IVA), pp 146–158

  28. Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66

    Google Scholar 

  29. Traum D, DeVault D, Lee J, Wang Z, Marsella S (2012) Incremental dialogue understanding and feedback for multiparty, multimodal conversation. In: Proceedings of intelligent virtual agents (IVA), Santa Cruz, CA, pp 275–288

  30. Truong KP, Poppe R, Kok I, Heylen D (2011) A multimodal analysis of vocal and visual backchannels in spontaneous dialogs. In: Proceedings of interspeech, Florence, Italy, pp 2973–2976

  31. Vinciarelli A, Pantic M, Heylen D, Pelachaud C, Poggi I, D’Errico F, Schröder M (2012) Bridging the gap between social animal and unsocial machine: a survey of social signal processing. IEEE Trans Affect Comput 3(1):69–87

    Article  Google Scholar 

  32. Ward N, Tsukahara W (2000) Prosodic features which cue back-channel responses in English and Japanese. J Pragmat 32(8):1177–1207

    Article  Google Scholar 

  33. van Welbergen H, Reidsma D, Ruttkay ZM, Zwiers J (2010) Elckerlyc—a BML realizer for continuous, multimodal interaction with a virtual human. J Multimod User Interf 3(4):271–284

    Article  Google Scholar 

  34. Xudong D (2009) Listener response. In: The pragmatics of interaction. John Benjamins Publishing, Amsterdam, pp 104–124

  35. Yngve VH (1970) On getting a word in edgewise. In: Sixth regional meeting of the Chicago Linguistic Society, vol 6, pp 657–677

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ronald Poppe.

Additional information

This publication was supported by the Dutch national program COMMIT and the EU FP7 project SSPNet. We would like to thank the anonymous reviewers for their constructive feedback, which helped us to improve the paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Kok, I., Poppe, R. & Heylen, D. Iterative perceptual learning for social behavior synthesis. J Multimodal User Interfaces 8, 231–241 (2014). https://doi.org/10.1007/s12193-013-0132-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12193-013-0132-1

Keywords

Navigation