A Realistic, Multimodal Virtual Agent for the Healthcare Domain
Guido M. Linders
Dept. of Cognitive Science & Artificial
Intelligence, Tilburg University
Tilburg, The Netherlands
g.m.linders@tilburguniversity.edu
Julija Vaitonytė
Dept. of Cognitive Science & Artificial
Intelligence, Tilburg University
Tilburg, The Netherlands
j.vaitonyte@tilburguniversity.edu
Kiril O. Mitev
Dept. of Cognitive Science & Artificial
Intelligence, Tilburg University
Tilburg, The Netherlands
k.o.mitev@tilburguniversity.edu
Maryam Alimardani
Dept. of Cognitive Science & Artificial
Intelligence, Tilburg University
Tilburg, The Netherlands
m.alimardani@tilburguniversity.edu
Max M. Louwerse
Dept. of Cognitive Science & Artificial
Intelligence, Tilburg University
Tilburg, The Netherlands
m.m.louwerse@tilburguniversity.edu
ABSTRACT
We introduce an interactive embodied conversational agent for
deployment in the healthcare sector. The agent is operated by a
software architecture that integrates speech recognition, dialog
management, and speech synthesis, and is embodied by a virtual
human face developed using photogrammetry techniques. These
features together allow for real-time, face-to-face interactions with
human users. Although the developed software architecture is
domain-independent and highly customizable, the virtual agent will
initially be applied to healtcare domain. Here we give an overview
of the different components of the architecture.
CCS CONCEPTS
• Computing methodologies → Artificial intelligence; Rendering; • Computer systems organization → Architectures.
KEYWORDS
embodied conversational agents, multimodal communication, healthcare domain
ACM Reference Format:
Guido M. Linders, Julija Vaitonytė, Maryam Alimardani, Kiril O. Mitev,
and Max M. Louwerse. 2022. A Realistic, Multimodal Virtual Agent for the
Healthcare Domain. In ACM International Conference on Intelligent Virtual
Agents (IVA ’22), September 6–9, 2022, Faro, Portugal. ACM, New York, NY,
USA, 3 pages. https://doi.org/10.1145/3514197.3551250
1
INTRODUCTION
Humans communicate using a large array of verbal and non-verbal
cues [1, 6]. These multimodal cues, i.e., speech, eye gaze, head
nods, and facial expressions, are equally valuable for virtual agents
to be used in human-agent interactions [13]. The benefit of the
ability to communicate multimodally on the part of the virtual
agent is evident in such domains as healthcare [12] and education
[11], which draw heavily on real-time, face-to-face encounters. In
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
IVA ’22, September 6–9, 2022, Faro, Portugal
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9248-8/22/09.
https://doi.org/10.1145/3514197.3551250
Figure 1: The face of the agent created using photogrammetry
techniques.
addition, humans prefer more realistic than less realistic artificial
entities [15], highlighting the need for a realistic appearance of
virtual agents [14]. Until recently, creating virtual agents that can
exhibit appropriate multimodal cues in a real-time interaction as
supported by realistic appearance has posed significant challenges
for developers and researchers. These challenges include dialog
management, the coupling of verbal and non-verbal cues, and a
realistic appearance of the agent. The current project aimed to make
progress on these aspects.
The current project is an interdisciplinary collaboration between
several academic institutions, industry partners and hospitals, with
the goal to create a virtual agent that is human-like in its appearance and behavior. Such an agent is meant to hold a face-to-face
conversation with a user. The agent will first be deployed in the
healthcare domain. More specifically, the agent will give healthcare
advice and information to users on bariatric surgery, the surgery
that facilitates weight loss. Patients will be able to ask the agent
questions that they might have related to this surgery. While the
agent will take over some of the communication between doctors
and patients, it is not meant to substitute a doctor but rather be
accessible and offer relevant information to patients at any time.
IVA ’22, September 6–9, 2022, Faro, Portugal
Guido M. Linders, Julija Vaitonytė, Maryam Alimardani, Kiril O. Mitev, and Max M. Louwerse
Figure 2: Overview of the agent architecture.
Creating the agent as part of the project consisted of first constructing a photogrammetry pipeline, such that real faces could
be scanned and 3D face models of high-fidelity could be created.
Currently, five different faces have been created using this pipeline
that differ in age, sex, skin color, and facial morphology. Figure 1
shows one of the faces. The movement of the face is controlled
through a set of 42 anatomically based facial descriptors, so-called
Action Units, created based on a custom and extended version of
the Facial Action Coding System (FACS) [5]. A set of 13 physiological parameters, i.e., blushing, paleness, perspiration has also been
implemented. Second, Rasa, an open-source neural network framework for natural language understanding and dialog management,
has been used for verbal interaction [3]. Third, the software architecture has been designed to allow for easy integration of different
communication channels at different modalities. Additionally, using
translation technology, the agent is able to converse in almost any
language.
Below we detail the components that comprise the virtual agent
architecture and explain how multimodal communication can be
realized by the agent. We also give an outlook on the future of the
agent, along with a brief discussion on some of the design choices
and challenges.
2
OVERVIEW OF COMPONENTS
The architecture behind the virtual agent consists of four core components: (1) the input channels, which currently consist of only the
Google speech recognition system, but allow for additional channels
and different modalities as well, (2) a dialog management system
implemented using Rasa, (3) a behavior generation system that uses
extended FACS to control the agent’s face and Microsoft Azure to
generate speech and visemes (the movements of the mouth and
lips), with the 3D face model being rendered in Unreal Engine, and
(4) a messaging system that ensures the communication between
the different components, implemented using RabbitMQ.
Figure 2 presents an overview of the architecture. First, user’s
speech is detected and transcribed. The end of an incoming message
is detected by a silence threshold of roughly one second. Optionally,
Google Translate can be used to translate the captured message to
English, when the message is in a different language than English.
The transcribed (and if necessary translated) speech of a user is
then sent to the messaging system, which further sends it to the
dialog manager. The dialog manager classifies each message into a
so-called intent, which conveys the information that the user tried
to communicate. Based on the intent classification and the dialog
history, the dialog manager selects an English text response. The
response is then sent to the behavior generation system via the
messaging system. If necessary, the behavior generation system
translates the English response back to the original language. It then
generates and aligns in time both speech (audio) and the visemes
that are produced on the agent’s face.
The virtual agent is currently distributed through a virtual machine that is accessible via the web browser, alleviating the need
to go through a complicated installation process and the need for
a powerful on-site computer. However, this approach may not be
optimal since only a single person can access the virtual machine
at a time.
3
MULTIMODAL COMMUNICATION
For a human-like agent, the integration of non-verbal cues with
speech is crucial. Currently, two main types of non-verbal behavior
are generated by the behavior generation system. These are the
mouth and lip movements that accompany the audio signals and
some idle movements, which include slight head movements and
eye movements. Idle movements are displayed continuously and
are relevant in making the agent appear alive [4].
Regarding dialog management, we have constructed a database
of question-answer pairs on the topic of bariatrics together with
other utterances to be fed as the training material for the dialog
manager. Due to the dialog manager requiring multiple examples
of input utterances for each utterance-response pair, we have been
investigating ways to partially automate the generation of these examples. Currently, a single response is generated for each incoming
message of the user.
4
OUTLOOK AND FUTURE WORK
We have developed a realistically looking communicative agent
that is being gradually improved. The general conversational and
turn-taking abilities are however not yet sophisticated due to a
rigorous one speaker at the time turn-taking model that does not
allow for interruptions and an end-of-turn detection model that is
based on a silence treshold. Next to improving these models, we are
investigating the statistical tendencies of different verbal and nonverbal communicative units that make a conversation human-like
[7, 9] and the ways to incorporate speaker intentions using dialog
act classification [8]. Finally, we are investigating how humans
perceive photorealistic agents more generally [14].
The integration of verbal and non-verbal channels is also nontrivial. Previous research has examined the integration of verbal
and non-verbal behaviors by using cross-recurrence analysis techniques, both at inter-personal and intra-personal levels [2, 10]. Such
findings can guide the coupling of verbal and non-verbal behaviors.
Investigating the coupling of facial expressions, eye movements and
speaker intentions, i.e., dialog acts, using cross-recurrence analysis
techniques are part of future directions to a naturally behaving
virtual agent.
A Realistic, Multimodal Virtual Agent for the Healthcare Domain
ACKNOWLEDGMENTS
This research has been funded by the grant No.: PROJ-007246 from
the Operational Program South, the European Union, the Ministry
of Economic Affairs, the Province of Noord-Brabant, and the municipality of Tilburg awarded to MML. Tilburg University (TiU),
BlueTea, Noldus, Breda University of Applied Sciences (BUas), Royal
Netherlands Aerospace Centre (NLR), and Fontys Applied University Eindhoven created the agent architecture. BUas created the
photogrammetry pipeline, 3D face models, and integrated Microsoft
Azure into the architecture. Spaarne Hospital together with TiU,
NLR, and Noldus created bariatrics training data. TiU implemented
dialog management. BlueTea and TiU implemented the virtual machine solution.
REFERENCES
[1] Janet B. Bavelas and Nicole Chovil. 2006. Nonverbal and Verbal Communication:
Hand Gestures and Facial Displays as Part of Language Use in Face-to-face
Dialogue. In The Sage Handbook of Nonverbal Communication, V. Manusov and
M. L. Patterson (Eds.). Sage Publications, Inc, 97—-115.
[2] Pieter A. Blomsma, Guido M. Linders, Julija Vaitonyte, and Max M. Louwerse.
2020. Intrapersonal dependencies in multimodal behavior. In Proceedings of the
20th ACM International Conference on Intelligent Virtual Agents. 1–8. https:
//doi.org/10.1145/3383652.3423872
[3] Tom Bocklisch, Joey Faulkner, Nick Pawlowski, and Alan Nichol. 2017. Rasa: Open
source language understanding and dialogue management. In 31st Conference
on Neural Information Processing Systems (NIPS 2017). Long Beach, CA, USA.
https://doi.org/10.48550/arXiv.1712.05181
[4] Raymond H. Cuijpers and Marco A. M. H. Knops. 2015. Motions of robots matter!
The social effects of idle and meaningful motions. In International Conference on
IVA ’22, September 6–9, 2022, Faro, Portugal
Social Robotics. Springer, 174–183. https://doi.org/10.1007/978-3-319-25554-5_18
[5] Paul Ekman and Wallace V Friesen. 1978. Facial action coding system: Investigator’s
guide. Consulting Psychologists Press.
[6] Rachael E Jack and Philippe G Schyns. 2015. The human face as a dynamic tool
for social communication. Current Biology 25, 14 (2015), R621–R634. https:
//doi.org/10.1016/j.cub.2015.05.052
[7] Guido M Linders and Max M Louwerse. 2020. Zipf’s Law in Human-Machine
Dialog. In Proceedings of the 20th ACM International Conference on Intelligent
Virtual Agents. 1–8. https://doi.org/10.1145/3383652.3423878
[8] Guido M. Linders and Max M. Louwerse. 2022. Surface and contextual linguistic
cues in dialog act classification. (2022). Manuscript submitted for review.
[9] Guido M. Linders and Max M. Louwerse. 2022. Zipf’s law revisited: Spoken
dialog, linguistic units, parameters, and the principle of least effort. Psychonomic
Bulletin & Review (2022). https://doi.org/10.3758/s13423-022-02142-9
[10] Max M Louwerse, Rick Dale, Ellen G Bard, and Patrick Jeuniaux. 2012. Behavior
matching in multimodal communication is synchronized. Cognitive Science 36, 8
(2012), 1404–1426. https://doi.org/10.1111/j.1551-6709.2012.01269.x
[11] Manuela Macedonia, Iris Groher, and Friedrich Roithmayr. 2014. Intelligent virtual agents as language trainers facilitate multilingualism. Frontiers in psychology
5 (2014), 295. https://doi.org/10.3389/fpsyg.2014.00295
[12] Joao Luis Zeni Montenegro, Cristiano André da Costa, and Rodrigo da Rosa Righi.
2019. Survey of conversational agents in health. Expert Systems with Applications
129 (2019), 56–67. https://doi.org/10.1016/j.eswa.2019.03.054
[13] Julija Vaitonyte, Pieter A Blomsma, Maryam Alimardani, and Max M Louwerse.
2019. Generating facial expression data: Computational and experimental evidence. In Proceedings of the 19th ACM International Conference on Intelligent
Virtual Agents. 94–96. https://doi.org/10.1145/3308532.3329443
[14] Julija Vaitonytė, Pieter A Blomsma, Maryam Alimardani, and Max M Louwerse.
2021. Realism of the face lies in skin and eyes: Evidence from virtual and human
agents. Computers in Human Behavior Reports 3 (2021), 100065. https://doi.org/
10.1016/j.chbr.2021.100065
[15] Nick Yee, Jeremy N Bailenson, and Kathryn Rickertsen. 2007. A meta-analysis of
the impact of the inclusion and realism of human-like faces on user experiences in
interfaces. In Proceedings of the SIGCHI conference on Human Factors in Computing
Systems. 1–10. https://doi.org/10.1145/1240624.1240626