[go: up one dir, main page]

Academia.eduAcademia.edu
A Realistic, Multimodal Virtual Agent for the Healthcare Domain Guido M. Linders Dept. of Cognitive Science & Artificial Intelligence, Tilburg University Tilburg, The Netherlands g.m.linders@tilburguniversity.edu Julija Vaitonytė Dept. of Cognitive Science & Artificial Intelligence, Tilburg University Tilburg, The Netherlands j.vaitonyte@tilburguniversity.edu Kiril O. Mitev Dept. of Cognitive Science & Artificial Intelligence, Tilburg University Tilburg, The Netherlands k.o.mitev@tilburguniversity.edu Maryam Alimardani Dept. of Cognitive Science & Artificial Intelligence, Tilburg University Tilburg, The Netherlands m.alimardani@tilburguniversity.edu Max M. Louwerse Dept. of Cognitive Science & Artificial Intelligence, Tilburg University Tilburg, The Netherlands m.m.louwerse@tilburguniversity.edu ABSTRACT We introduce an interactive embodied conversational agent for deployment in the healthcare sector. The agent is operated by a software architecture that integrates speech recognition, dialog management, and speech synthesis, and is embodied by a virtual human face developed using photogrammetry techniques. These features together allow for real-time, face-to-face interactions with human users. Although the developed software architecture is domain-independent and highly customizable, the virtual agent will initially be applied to healtcare domain. Here we give an overview of the different components of the architecture. CCS CONCEPTS • Computing methodologies → Artificial intelligence; Rendering; • Computer systems organization → Architectures. KEYWORDS embodied conversational agents, multimodal communication, healthcare domain ACM Reference Format: Guido M. Linders, Julija Vaitonytė, Maryam Alimardani, Kiril O. Mitev, and Max M. Louwerse. 2022. A Realistic, Multimodal Virtual Agent for the Healthcare Domain. In ACM International Conference on Intelligent Virtual Agents (IVA ’22), September 6–9, 2022, Faro, Portugal. ACM, New York, NY, USA, 3 pages. https://doi.org/10.1145/3514197.3551250 1 INTRODUCTION Humans communicate using a large array of verbal and non-verbal cues [1, 6]. These multimodal cues, i.e., speech, eye gaze, head nods, and facial expressions, are equally valuable for virtual agents to be used in human-agent interactions [13]. The benefit of the ability to communicate multimodally on the part of the virtual agent is evident in such domains as healthcare [12] and education [11], which draw heavily on real-time, face-to-face encounters. In Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). IVA ’22, September 6–9, 2022, Faro, Portugal © 2022 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9248-8/22/09. https://doi.org/10.1145/3514197.3551250 Figure 1: The face of the agent created using photogrammetry techniques. addition, humans prefer more realistic than less realistic artificial entities [15], highlighting the need for a realistic appearance of virtual agents [14]. Until recently, creating virtual agents that can exhibit appropriate multimodal cues in a real-time interaction as supported by realistic appearance has posed significant challenges for developers and researchers. These challenges include dialog management, the coupling of verbal and non-verbal cues, and a realistic appearance of the agent. The current project aimed to make progress on these aspects. The current project is an interdisciplinary collaboration between several academic institutions, industry partners and hospitals, with the goal to create a virtual agent that is human-like in its appearance and behavior. Such an agent is meant to hold a face-to-face conversation with a user. The agent will first be deployed in the healthcare domain. More specifically, the agent will give healthcare advice and information to users on bariatric surgery, the surgery that facilitates weight loss. Patients will be able to ask the agent questions that they might have related to this surgery. While the agent will take over some of the communication between doctors and patients, it is not meant to substitute a doctor but rather be accessible and offer relevant information to patients at any time. IVA ’22, September 6–9, 2022, Faro, Portugal Guido M. Linders, Julija Vaitonytė, Maryam Alimardani, Kiril O. Mitev, and Max M. Louwerse Figure 2: Overview of the agent architecture. Creating the agent as part of the project consisted of first constructing a photogrammetry pipeline, such that real faces could be scanned and 3D face models of high-fidelity could be created. Currently, five different faces have been created using this pipeline that differ in age, sex, skin color, and facial morphology. Figure 1 shows one of the faces. The movement of the face is controlled through a set of 42 anatomically based facial descriptors, so-called Action Units, created based on a custom and extended version of the Facial Action Coding System (FACS) [5]. A set of 13 physiological parameters, i.e., blushing, paleness, perspiration has also been implemented. Second, Rasa, an open-source neural network framework for natural language understanding and dialog management, has been used for verbal interaction [3]. Third, the software architecture has been designed to allow for easy integration of different communication channels at different modalities. Additionally, using translation technology, the agent is able to converse in almost any language. Below we detail the components that comprise the virtual agent architecture and explain how multimodal communication can be realized by the agent. We also give an outlook on the future of the agent, along with a brief discussion on some of the design choices and challenges. 2 OVERVIEW OF COMPONENTS The architecture behind the virtual agent consists of four core components: (1) the input channels, which currently consist of only the Google speech recognition system, but allow for additional channels and different modalities as well, (2) a dialog management system implemented using Rasa, (3) a behavior generation system that uses extended FACS to control the agent’s face and Microsoft Azure to generate speech and visemes (the movements of the mouth and lips), with the 3D face model being rendered in Unreal Engine, and (4) a messaging system that ensures the communication between the different components, implemented using RabbitMQ. Figure 2 presents an overview of the architecture. First, user’s speech is detected and transcribed. The end of an incoming message is detected by a silence threshold of roughly one second. Optionally, Google Translate can be used to translate the captured message to English, when the message is in a different language than English. The transcribed (and if necessary translated) speech of a user is then sent to the messaging system, which further sends it to the dialog manager. The dialog manager classifies each message into a so-called intent, which conveys the information that the user tried to communicate. Based on the intent classification and the dialog history, the dialog manager selects an English text response. The response is then sent to the behavior generation system via the messaging system. If necessary, the behavior generation system translates the English response back to the original language. It then generates and aligns in time both speech (audio) and the visemes that are produced on the agent’s face. The virtual agent is currently distributed through a virtual machine that is accessible via the web browser, alleviating the need to go through a complicated installation process and the need for a powerful on-site computer. However, this approach may not be optimal since only a single person can access the virtual machine at a time. 3 MULTIMODAL COMMUNICATION For a human-like agent, the integration of non-verbal cues with speech is crucial. Currently, two main types of non-verbal behavior are generated by the behavior generation system. These are the mouth and lip movements that accompany the audio signals and some idle movements, which include slight head movements and eye movements. Idle movements are displayed continuously and are relevant in making the agent appear alive [4]. Regarding dialog management, we have constructed a database of question-answer pairs on the topic of bariatrics together with other utterances to be fed as the training material for the dialog manager. Due to the dialog manager requiring multiple examples of input utterances for each utterance-response pair, we have been investigating ways to partially automate the generation of these examples. Currently, a single response is generated for each incoming message of the user. 4 OUTLOOK AND FUTURE WORK We have developed a realistically looking communicative agent that is being gradually improved. The general conversational and turn-taking abilities are however not yet sophisticated due to a rigorous one speaker at the time turn-taking model that does not allow for interruptions and an end-of-turn detection model that is based on a silence treshold. Next to improving these models, we are investigating the statistical tendencies of different verbal and nonverbal communicative units that make a conversation human-like [7, 9] and the ways to incorporate speaker intentions using dialog act classification [8]. Finally, we are investigating how humans perceive photorealistic agents more generally [14]. The integration of verbal and non-verbal channels is also nontrivial. Previous research has examined the integration of verbal and non-verbal behaviors by using cross-recurrence analysis techniques, both at inter-personal and intra-personal levels [2, 10]. Such findings can guide the coupling of verbal and non-verbal behaviors. Investigating the coupling of facial expressions, eye movements and speaker intentions, i.e., dialog acts, using cross-recurrence analysis techniques are part of future directions to a naturally behaving virtual agent. A Realistic, Multimodal Virtual Agent for the Healthcare Domain ACKNOWLEDGMENTS This research has been funded by the grant No.: PROJ-007246 from the Operational Program South, the European Union, the Ministry of Economic Affairs, the Province of Noord-Brabant, and the municipality of Tilburg awarded to MML. Tilburg University (TiU), BlueTea, Noldus, Breda University of Applied Sciences (BUas), Royal Netherlands Aerospace Centre (NLR), and Fontys Applied University Eindhoven created the agent architecture. BUas created the photogrammetry pipeline, 3D face models, and integrated Microsoft Azure into the architecture. Spaarne Hospital together with TiU, NLR, and Noldus created bariatrics training data. TiU implemented dialog management. BlueTea and TiU implemented the virtual machine solution. REFERENCES [1] Janet B. Bavelas and Nicole Chovil. 2006. Nonverbal and Verbal Communication: Hand Gestures and Facial Displays as Part of Language Use in Face-to-face Dialogue. In The Sage Handbook of Nonverbal Communication, V. Manusov and M. L. Patterson (Eds.). Sage Publications, Inc, 97—-115. [2] Pieter A. Blomsma, Guido M. Linders, Julija Vaitonyte, and Max M. Louwerse. 2020. Intrapersonal dependencies in multimodal behavior. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8. https: //doi.org/10.1145/3383652.3423872 [3] Tom Bocklisch, Joey Faulkner, Nick Pawlowski, and Alan Nichol. 2017. Rasa: Open source language understanding and dialogue management. In 31st Conference on Neural Information Processing Systems (NIPS 2017). Long Beach, CA, USA. https://doi.org/10.48550/arXiv.1712.05181 [4] Raymond H. Cuijpers and Marco A. M. H. Knops. 2015. Motions of robots matter! The social effects of idle and meaningful motions. In International Conference on IVA ’22, September 6–9, 2022, Faro, Portugal Social Robotics. Springer, 174–183. https://doi.org/10.1007/978-3-319-25554-5_18 [5] Paul Ekman and Wallace V Friesen. 1978. Facial action coding system: Investigator’s guide. Consulting Psychologists Press. [6] Rachael E Jack and Philippe G Schyns. 2015. The human face as a dynamic tool for social communication. Current Biology 25, 14 (2015), R621–R634. https: //doi.org/10.1016/j.cub.2015.05.052 [7] Guido M Linders and Max M Louwerse. 2020. Zipf’s Law in Human-Machine Dialog. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8. https://doi.org/10.1145/3383652.3423878 [8] Guido M. Linders and Max M. Louwerse. 2022. Surface and contextual linguistic cues in dialog act classification. (2022). Manuscript submitted for review. [9] Guido M. Linders and Max M. Louwerse. 2022. Zipf’s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort. Psychonomic Bulletin & Review (2022). https://doi.org/10.3758/s13423-022-02142-9 [10] Max M Louwerse, Rick Dale, Ellen G Bard, and Patrick Jeuniaux. 2012. Behavior matching in multimodal communication is synchronized. Cognitive Science 36, 8 (2012), 1404–1426. https://doi.org/10.1111/j.1551-6709.2012.01269.x [11] Manuela Macedonia, Iris Groher, and Friedrich Roithmayr. 2014. Intelligent virtual agents as language trainers facilitate multilingualism. Frontiers in psychology 5 (2014), 295. https://doi.org/10.3389/fpsyg.2014.00295 [12] Joao Luis Zeni Montenegro, Cristiano André da Costa, and Rodrigo da Rosa Righi. 2019. Survey of conversational agents in health. Expert Systems with Applications 129 (2019), 56–67. https://doi.org/10.1016/j.eswa.2019.03.054 [13] Julija Vaitonyte, Pieter A Blomsma, Maryam Alimardani, and Max M Louwerse. 2019. Generating facial expression data: Computational and experimental evidence. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents. 94–96. https://doi.org/10.1145/3308532.3329443 [14] Julija Vaitonytė, Pieter A Blomsma, Maryam Alimardani, and Max M Louwerse. 2021. Realism of the face lies in skin and eyes: Evidence from virtual and human agents. Computers in Human Behavior Reports 3 (2021), 100065. https://doi.org/ 10.1016/j.chbr.2021.100065 [15] Nick Yee, Jeremy N Bailenson, and Kathryn Rickertsen. 2007. A meta-analysis of the impact of the inclusion and realism of human-like faces on user experiences in interfaces. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 1–10. https://doi.org/10.1145/1240624.1240626