SAMIR: Your 3D Virtual Bookseller
F. Zambetta, G. Catucci, F. Abbattista, and G. Semeraro
Dipartimento di Informatica, University of Bari, Italy
Via E. Orabona, 4 I-70125 – Bari (I)
{zambetta,fabio,semeraro}@di.uniba.it
gracat@email.it
Abstract. Intelligent web agents, that exhibit a complex behavior, i.e.
an autonomous rather than a merely reactive one, are daily gaining
popularity as they allow a simpler and more natural interaction
metaphor between the user and the machine, entertaining him/her and
giving to some extent the illusion of interacting with a human-like
interface. In this paper we describe SAMIR, an intelligent web agent
satisfying the objectives listed above. It uses: - a 3D face animated via a
morph-target technique to convey expressions to be communicated to
the user, - a slightly modified version of the ALICE chatterbot to
provide the user with dialoguing capabilities, - an XCS classifier
system to manage the consistency between conversation and the face
expressions. We also show some experimental results obtained
applying SAMIR to a virtual bookselling scenario, involving the wellknown Amazon.com site.
1
Introduction
Intelligent virtual agents are software components designed to act as virtual advisors
into applications, especially web ones, where a high level of human computer
interaction is required. Indeed, their aim is to substitute the classical WYSIWYG
interfaces, which are often difficult to manage by casual users, with reactive and
possibly pro-active virtual ciceros able to understand users' wishes and converse with
them, find information and execute non-trivial tasks usually activated by pressing
buttons and choosing menu items. Frequently these systems are coupled with an
animated 2D/3D look-and-feel, embodying their intelligence via a face or an entire
body. This way it is possible to enhance users trust into these systems simulating a
face-to-face dialogue, as reported in [1].
A very complete agent of this kind, frequently called an ECA (Embodied
Conversational Agent), is REA [1], a Real Estate Agent able to converse with the
users and sell them a house that complies with their wishes and needs. The interaction
occurs in real time via sensors acquiring user facial expressions and hands pointing;
moreover speech recognition is performed to bypass users need of writing their
requests. REA answers using its body posture, its facial expressions and digitized
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1249-1257, 2003.
Springer-Verlag Berlin Heidelberg 2003
1250
F. Zambetta et al.
sounds rendering the salesperson recommendations and utterances. The EMBASSI
system [2] was born by a very big consortium occupied in defining the technologies
and their ergonomics requirements to implement an intelligent shop assistant to
facilitate user purchasing and information retrieval. These objectives are pursued by
multi-modal interaction: common text dialogs as well as speech recognition devices
are used to sense user requests whilst a 3D face is the front-end of the system. The
agent is able to send its response also via classical multimedia content (video-clips,
hyperlinks, etc.). In [3] an agent is described which is not just able to be animated but
also to answer to users based on emotions modeled on the Five Factor Model (FFM)
of personality [4] and implemented using Bayesian Belief Networks. Moreover the
Alice chatterbot is used in order to let the web agent process and generate responses
into the classical textual form.
The MS Office assistants deserve to be mentioned in the agent panorama because
of their widespread use, even though they are quite shallow in some respects.
Sometimes they are too invasive and do not exhibit a very complex behavior,
however, they suffice to help the inexperienced user in many situations. They are
based upon Microsoft Agent technology that enables multiple applications, or clients,
to use animation services such as loading a character, playing a specified animation
and responding to user input. Spoken text appears in a word balloon but several TTS
(text-to-speech) engines may be used to play it. Another example of a Microsoft
Agent is IMP (Instant Messaging Personalities, see http://www.eclips.com.au), which
gives a very good idea of how an agent can help the users in handling all Microsoft
Messenger features such as mail checking, instant messaging and so on. The agents
have lips-synced faces pronouncing the messages the user receives and they tell the
user if any of his/her contacts have just gone online, offline, away etc.
A general observation is that the mentioned systems, though interesting, are
generally heavy to implement, difficult to port onto different platforms, and usually
not embeddable in Web browsers. We pursue a light solution, which should be
portable, easy to implement and fast enough in medium-sized computer
environments.
In this paper we present the SAMIR (Scenographic Agents Mimic Intelligent
Reasoning) system, a digital assistant where an artificial intelligence based Web agent
is integrated with a purely 3D humanoid, robotic, or cartoon-like layout [5].
The remainder of the paper is organized as follows. Section 2 describes the
architecture of SAMIR. In Sections 3, 4 and 5, the three main modules of SAMIR are
detailed. Some examples of SAMIR in action are given in Section 6. Finally,
conclusions are drawn in Section 7.
2
The Architecture of SAMIR
SAMIR (see Figure 1) is a client-server application, composed of 3 main sub-systems
detailed in the next sections: the Data Management System (DMS), the Behavior
Manager and the Animation System.
The DMS is responsible for directing the flow of information in our system: When
the user issues a request from the web site, via a common form, an HTTP request is
SAMIR: Your 3D Virtual Bookseller
1251
directed to the DMS Server to obtain the HTTP response storing the chatterbot
answer. At the same time, based on the events raised by the user on the web site and
on his/her requests, a communication between the DMS and the Behavior Manager is
set up. This results into a string encoding the expression the Animation System
should assume. This string specifies coefficients for each of the possible morph
targets [6] into our system: We use some high-level morph targets corresponding to
the well-known fundamental expressions [7] but even low-level ones are a feasible
choice in order to preserve full MPEG-4 compliance. After this interpretation step, a
key-frame interpolation is performed to animate the current expression.
Fig. 1. The Architecture of SAMIR
2.1
The Animation System
The FACE (Facial Animation Compact Engine) Animation System is an evolution of
the Fanky Animation System [8].
FACE was conceived keeping in mind lightness and performance so that it
supports a variable number of morph targets: For example we currently use either 12
high-level ones or the number of the entire “low-level” FAP set, in order to achieve
MPEG-4 compliance [9]. Using just a small set of high-level parameters might be
extremely useful when debugging the behavior module because it is easier to reason
about behavioral patterns in terms of expressions rather than longer sets of facial
parameters.
Besides it is clear that using a reduced set of parameters might avoid bandwidth
limitations and this can be a major advantage in porting this animation module to a
small device such as a Pocket PC [10], a process we are beginning to experiment
with.
1252
F. Zambetta et al.
An unlimited number of timelines can be used allocating one channel for the
stimulus-response expressions, another one for eye-lid non-conscious reflexes,
another one for head non-conscious reflexes and so on. We are integrating a TTS
engine into our system and, for this reason, another channel for visemes will be used.
In Figure 2 some expressions taken from an animation stream are illustrated.
We are developing a custom editor able to perform the same tasks performed by
FaceGen but giving more control to the user: This way, we believe, each user, both
the unexperienced one and the experienced one, might enjoy the process of creating a
new face, tailored to his/her wishes, who could use some specific low-level
deformation tools, based upon the well-known FFD technique [11].
Fig. 2. Some expressions assumed by a 3D face
2.2
Fig. 3. The DMS Architecture
The Dialogue Management System
The DMS (Dialogue Management System) is responsible for the management of user
dialogues and for the extraction of the necessary information for book searching. The
DMS can be viewed as a client-server application composed mainly by two software
modules, communicating through the HTTP protocol (see Figure 3). The client side
application is just a simple Java applet whose main aim is to let user to type requests
in a human-like language and to send these ones to the server side application in
order to process them. The other important task it is able to perform is retrieving
SAMIR: Your 3D Virtual Bookseller
1253
specific information, based on the responses elaborated by the server-side application,
on the World Wide Web through the JavaScript technology. On the server side we
have the ALICE Server Engine enclosing all the knowledge and the core system
services to process user input. ALICE is an open source chatterbot developed by the
ALICE AI Foundation and based on the AIML language (Artificial Intelligence
Markup Language), an XML-compliant language that gives us the opportunity to
exchange dialogues data through the World Wide Web. ALICE has been fully
integrated in SAMIR as a Java Servlet and all the knowledge of the system has been
stored in the AIML files, containing all the patterns matching user input. Dialogues
data are exchanged through simple built-in classes handling the classical HTTP
sockets communication.
In order to obtain a system able to let users navigating in a bookshop web site, we
wrote some AIML categories finalized to book searching and shopping. An AIML
category is a simple unit which contains a “pattern” section for matching user input
and a correspondent “template” section containing an adequate response and/or action
(i.e. a JavaScript execution) to user requests.
Our categories were chosen to cover a very large set of all possible manners for
requesting a given book in a real bookstore. We considered a set of seven fields that
let a user to specify books he/she is interested in. They include the book title, author,
publisher, publication date, subject, ISBN code and a more general field keyword.
Successful examples of book requests for the Amazon bookshop web site are the
following: I want a book written by Sepulveda, I am looking for a book entitled
Journey and whose author is Celine, I am searching for all books written by Henry
Miller and published after 1970, I am interested in a book about horror or, in
alternative forms, it is possible to send requests like: Could you find some book
written by Fernando Pessoa?, Search all books whose author is Charles Bukowski,
Give me the book whose ISBN code is 0-13-273350-1, Look for some book whose
subject is fantasy
Clearly, AIML categories do not fit well for user requests that exhibit a high level
of ambiguity due to the peculiar characteristics of human language interaction.
2.3
The Behavior Generator
The Behavior Generator aims at managing the consistency between the facial
expression of the character and the conversation tone. The module is mainly based on
Learning Classifier Systems (LCS), a machine learning paradigm introduced by
Holland in 1976 [12]. The learning module of SAMIR has been implemented through
an XCS [13], a new kind of LCS, which differs in many aspects from the traditional
Holland's framework. The most appealing characteristic of this system is that it is
very related to the Q-learning but it can generate task representations which can be
more compact than tabular Q-learning [14]. At discrete time intervals, the agent
observes a state of the environment, takes an action, observes a new state and finally
receives an immediate reward.
The basic components of an XCS are: Performance Component, that, on the
ground of the detected state of the environment, selects the better action to be
performed; Reinforcement Component, whose aim is to evaluate the reward to be
1254
F. Zambetta et al.
assigned to the system; Discovery Component which, in case of degrading
performance, is devoted to the evolution of new, more performing rules.
The environment in which SAMIR has to act is represented by the user dialogue
(the higher the user satisfaction the higher the reward received from SAMIR). At the
very beginning of its life, the behavior of SAMIR is controlled by a set of random
generated rules and, consequently, its capability is very low.
Behavior rules are expressed in the classical format if <condition> then <action>,
where <condition> (the state of the environment) represents a combination of 4
possible events, sensed by 4 effectors, representing different conversation tones such
as: user salutation (user performs/does not perform salutation), user request
formulation to the agent (no request, polite, impolite), user compliments/insults to the
agent (no compliment, a compliment, an insult, a foul language), user permanence in
the Web page (user changes/does not change the page) while <action> represents the
expression that the Animation System displays during user interaction. In particular,
the expression is built as a linear combination of a set of fundamental expressions that
includes the basic emotion set proposed by Paul Ekman, namely anger, fear, disgust,
sadness, joy, and surprise [7]. Other emotions and many combinations of emotions
have been studied but remain unconfirmed as universally distinguishable. However,
the basic set of expressions has been extended in order to include some typical human
expression such as bother, disappointment and satisfaction. The Behavior Manager,
as explained above, is able to produce synthetic facial expressions, to be shown
according to the content of the ongoing conversation. Thus the <action> part
provides the Animation System with the percentage of each one of the expressions, to
be used to compose the desired expression of our character. For example, an
expression composed by 40% of joy and 60% of surprise is coded into the following
string:
0100
0000
0110 0000 0000
0000
0000
0000
0000
% Surprise % Sadness % Joy % Fear % Disgust % Anger % Bother % Disappointment% Satisfaction
During its life, SAMIR performs several interactions with different users and, on
the ground of the received reward, the XCS evolves better behavior rules in order to
achieve ever better performance. In a preliminary phase we defined a set of 30
interaction rules in which different situations in the course of an interaction (actions
performed, users requests, etc.) have been considered. This set of pre-defined rules
represented the training set that is the minimal know-how that SAMIR should possess
to start its work in the Web site.
To evaluate the performance of the system we performed some preliminary
experiments aiming at verifying the effectiveness of the User Interaction module to
learn a set of 30 predefined rules. Due to the inherent features of the XCS, SAMIR
has been able to learn quite effectively the pre-defined rules of behavior and to
generalize some new behavioral pattern that could update the initial set of rules. In
such a way, SAMIR is comparable with a human assistant that, after a preliminary
phase of training, will continue to learn new rules of behavior on the basis of personal
experiences and interaction with human customers.
SAMIR: Your 3D Virtual Bookseller
3
1255
Experimental Results
In this section we present some experimental results obtained from the interaction
between SAMIR and some typical users searching for books about topics like
literature, fantasy and horror or for more specific books whose information like title,
author and publisher are given.
When the user connects to the site, SAMIR presents itself and asks to the user
his/her name for user authentication and recognition. Figure 4 shows the results of a
user request about a horror book. The result of the query is a set of books in this
genre available at the Amazon book site. It can be noticed that the book ranked first is
by the author Stephen King.
Figure 5 is an example of a more sophisticated query in which there is a request for
Henry Miller books published after 1970. In this case the user gives a heavy insult to
SAMIR and, consequently, its expression is angry.
Fig. 4. Requesting an horror book
4
Conclusions
In this paper we presented a first prototype of a 3D agent able to support users in
searching for books into a Web site. Actually, the prototype has been linked to a
specific site but we are currently implementing an improved version that will be able
to query several Web bookstores simultaneously and to report, to users, a sort of
comparison based on different criteria such as prices, delivery times, etc.
Moreover, our work will be aimed to give a more natural behavior to our agent.
This can be achieved improving dialogues, and eventually, the text processing
1256
F. Zambetta et al.
capabilities of the ALICE chatterbot, and giving the agent a full proactive behavior:
the XCS should be able not only to learn new rules to generate facial expressions but
also to modify dialogue rules, to suggest interesting links and to supply an effective
help during the site navigation.
Fig. 5. All books by Henry Miller published after 1970
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
J. Cassell et al. (Eds.), Embodied Conversational Agents. MIT Press,
Cambridge, 2000.
M. Jalali-Sohi & F. Baskaya, A Multimodal Shopping Assistant for Home ECommerce. In: Proceedings of the 14th Int‘l FLAIRS Conf. (Key West FL,
2001), 2-6.
S. Kshirsagar & N. Magnenat-Thalmann, Virtual Humans Personified. In:
Proceedings of the Autonomous Agents Conference (AAMAS), (Bologna,
2002).
R.R. McCrae & O. P. John, An introduction to the five factor model and its
applications, J. of Personality, 60 (1992), 175-215.
F. Abbattista, A. Paradiso, G. Semeraro, F. Zambetta: An agent that learns to
support users of a web site, in: R. Roy, M. Koeppen, S. Ovaska. T. Furuhashi
and F. Hoffmann (Eds.), Soft Computing and Industry: Recent Applications,
Springer, 2002, 489-496.
B. Fleming & D. Dobbs, Animating Facial Features and Expressions. Charles
River Media, Hingham, 1998.
P. Ekman, Emotion in the human face. Cambridge University Press,
Cambridge, 1982.
SAMIR: Your 3D Virtual Bookseller
[8]
[9]
[10]
[11]
[12]
[13]
[14]
View publication stats
1257
A. Paradiso, F. Zambetta, F. Abbattista., Fanky: a tool for animating 3D
intelligent agents. In: A. de Antonio, R. Aylett, D. Ballin, Intelligent Virtual
Agents, (Madris, 2001), Springer, Berlin, 242-243.
MPEG-4 standard specification.
http://mpeg.telecomitalialab.com/standards/mpeg-4/mpeg-4.htm.
The Pocket PC website.
http://www.microsoft.com/mobile/pocketpc/default.asp.
Sederburg, T.W., Free-Form Deformation of Solid Geometric Models,
Computer Graphics, 20(4), (1986), 151-160.
Holland, J.H.: Adaptation. In R. Rosen and F.M. Snell (Eds.), Progress in
Theoretical Biology, New York: Plenum, 1976.
Wilson, S.W.: Classifier Fitness based on Accuracy. In: Evolutionary
Computation 3/2, (1995), 149-175.
Watkins, C.J.C.H., Learning from delayed rewards, PhD thesis, University of
Cambridge, Psychology Department, 1989.