SAMIR: Your 3D Virtual Bookseller

SAMIR: Your 3D Virtual Bookseller F. Zambetta, G. Catucci, F. Abbattista, and G. Semeraro Dipartimento di Informatica, University of Bari, Italy Via E. Orabona, 4 I-70125 – Bari (I) {zambetta,fabio,semeraro}@di.uniba.it gracat@email.it Abstract. Intelligent web agents, that exhibit a complex behavior, i.e. an autonomous rather than a merely reactive one, are daily gaining popularity as they allow a simpler and more natural interaction metaphor between the user and the machine, entertaining him/her and giving to some extent the illusion of interacting with a human-like interface. In this paper we describe SAMIR, an intelligent web agent satisfying the objectives listed above. It uses: - a 3D face animated via a morph-target technique to convey expressions to be communicated to the user, - a slightly modified version of the ALICE chatterbot to provide the user with dialoguing capabilities, - an XCS classifier system to manage the consistency between conversation and the face expressions. We also show some experimental results obtained applying SAMIR to a virtual bookselling scenario, involving the wellknown Amazon.com site. 1 Introduction Intelligent virtual agents are software components designed to act as virtual advisors into applications, especially web ones, where a high level of human computer interaction is required. Indeed, their aim is to substitute the classical WYSIWYG interfaces, which are often difficult to manage by casual users, with reactive and possibly pro-active virtual ciceros able to understand users' wishes and converse with them, find information and execute non-trivial tasks usually activated by pressing buttons and choosing menu items. Frequently these systems are coupled with an animated 2D/3D look-and-feel, embodying their intelligence via a face or an entire body. This way it is possible to enhance users trust into these systems simulating a face-to-face dialogue, as reported in [1]. A very complete agent of this kind, frequently called an ECA (Embodied Conversational Agent), is REA [1], a Real Estate Agent able to converse with the users and sell them a house that complies with their wishes and needs. The interaction occurs in real time via sensors acquiring user facial expressions and hands pointing; moreover speech recognition is performed to bypass users need of writing their requests. REA answers using its body posture, its facial expressions and digitized V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1249-1257, 2003.  Springer-Verlag Berlin Heidelberg 2003 1250 F. Zambetta et al. sounds rendering the salesperson recommendations and utterances. The EMBASSI system [2] was born by a very big consortium occupied in defining the technologies and their ergonomics requirements to implement an intelligent shop assistant to facilitate user purchasing and information retrieval. These objectives are pursued by multi-modal interaction: common text dialogs as well as speech recognition devices are used to sense user requests whilst a 3D face is the front-end of the system. The agent is able to send its response also via classical multimedia content (video-clips, hyperlinks, etc.). In [3] an agent is described which is not just able to be animated but also to answer to users based on emotions modeled on the Five Factor Model (FFM) of personality [4] and implemented using Bayesian Belief Networks. Moreover the Alice chatterbot is used in order to let the web agent process and generate responses into the classical textual form. The MS Office assistants deserve to be mentioned in the agent panorama because of their widespread use, even though they are quite shallow in some respects. Sometimes they are too invasive and do not exhibit a very complex behavior, however, they suffice to help the inexperienced user in many situations. They are based upon Microsoft Agent technology that enables multiple applications, or clients, to use animation services such as loading a character, playing a specified animation and responding to user input. Spoken text appears in a word balloon but several TTS (text-to-speech) engines may be used to play it. Another example of a Microsoft Agent is IMP (Instant Messaging Personalities, see http://www.eclips.com.au), which gives a very good idea of how an agent can help the users in handling all Microsoft Messenger features such as mail checking, instant messaging and so on. The agents have lips-synced faces pronouncing the messages the user receives and they tell the user if any of his/her contacts have just gone online, offline, away etc. A general observation is that the mentioned systems, though interesting, are generally heavy to implement, difficult to port onto different platforms, and usually not embeddable in Web browsers. We pursue a light solution, which should be portable, easy to implement and fast enough in medium-sized computer environments. In this paper we present the SAMIR (Scenographic Agents Mimic Intelligent Reasoning) system, a digital assistant where an artificial intelligence based Web agent is integrated with a purely 3D humanoid, robotic, or cartoon-like layout [5]. The remainder of the paper is organized as follows. Section 2 describes the architecture of SAMIR. In Sections 3, 4 and 5, the three main modules of SAMIR are detailed. Some examples of SAMIR in action are given in Section 6. Finally, conclusions are drawn in Section 7. 2 The Architecture of SAMIR SAMIR (see Figure 1) is a client-server application, composed of 3 main sub-systems detailed in the next sections: the Data Management System (DMS), the Behavior Manager and the Animation System. The DMS is responsible for directing the flow of information in our system: When the user issues a request from the web site, via a common form, an HTTP request is SAMIR: Your 3D Virtual Bookseller 1251 directed to the DMS Server to obtain the HTTP response storing the chatterbot answer. At the same time, based on the events raised by the user on the web site and on his/her requests, a communication between the DMS and the Behavior Manager is set up. This results into a string encoding the expression the Animation System should assume. This string specifies coefficients for each of the possible morph targets [6] into our system: We use some high-level morph targets corresponding to the well-known fundamental expressions [7] but even low-level ones are a feasible choice in order to preserve full MPEG-4 compliance. After this interpretation step, a key-frame interpolation is performed to animate the current expression. Fig. 1. The Architecture of SAMIR 2.1 The Animation System The FACE (Facial Animation Compact Engine) Animation System is an evolution of the Fanky Animation System [8]. FACE was conceived keeping in mind lightness and performance so that it supports a variable number of morph targets: For example we currently use either 12 high-level ones or the number of the entire “low-level” FAP set, in order to achieve MPEG-4 compliance [9]. Using just a small set of high-level parameters might be extremely useful when debugging the behavior module because it is easier to reason about behavioral patterns in terms of expressions rather than longer sets of facial parameters. Besides it is clear that using a reduced set of parameters might avoid bandwidth limitations and this can be a major advantage in porting this animation module to a small device such as a Pocket PC [10], a process we are beginning to experiment with. 1252 F. Zambetta et al. An unlimited number of timelines can be used allocating one channel for the stimulus-response expressions, another one for eye-lid non-conscious reflexes, another one for head non-conscious reflexes and so on. We are integrating a TTS engine into our system and, for this reason, another channel for visemes will be used. In Figure 2 some expressions taken from an animation stream are illustrated. We are developing a custom editor able to perform the same tasks performed by FaceGen but giving more control to the user: This way, we believe, each user, both the unexperienced one and the experienced one, might enjoy the process of creating a new face, tailored to his/her wishes, who could use some specific low-level deformation tools, based upon the well-known FFD technique [11]. Fig. 2. Some expressions assumed by a 3D face 2.2 Fig. 3. The DMS Architecture The Dialogue Management System The DMS (Dialogue Management System) is responsible for the management of user dialogues and for the extraction of the necessary information for book searching. The DMS can be viewed as a client-server application composed mainly by two software modules, communicating through the HTTP protocol (see Figure 3). The client side application is just a simple Java applet whose main aim is to let user to type requests in a human-like language and to send these ones to the server side application in order to process them. The other important task it is able to perform is retrieving SAMIR: Your 3D Virtual Bookseller 1253 specific information, based on the responses elaborated by the server-side application, on the World Wide Web through the JavaScript technology. On the server side we have the ALICE Server Engine enclosing all the knowledge and the core system services to process user input. ALICE is an open source chatterbot developed by the ALICE AI Foundation and based on the AIML language (Artificial Intelligence Markup Language), an XML-compliant language that gives us the opportunity to exchange dialogues data through the World Wide Web. ALICE has been fully integrated in SAMIR as a Java Servlet and all the knowledge of the system has been stored in the AIML files, containing all the patterns matching user input. Dialogues data are exchanged through simple built-in classes handling the classical HTTP sockets communication. In order to obtain a system able to let users navigating in a bookshop web site, we wrote some AIML categories finalized to book searching and shopping. An AIML category is a simple unit which contains a “pattern” section for matching user input and a correspondent “template” section containing an adequate response and/or action (i.e. a JavaScript execution) to user requests. Our categories were chosen to cover a very large set of all possible manners for requesting a given book in a real bookstore. We considered a set of seven fields that let a user to specify books he/she is interested in. They include the book title, author, publisher, publication date, subject, ISBN code and a more general field keyword. Successful examples of book requests for the Amazon bookshop web site are the following: I want a book written by Sepulveda, I am looking for a book entitled Journey and whose author is Celine, I am searching for all books written by Henry Miller and published after 1970, I am interested in a book about horror or, in alternative forms, it is possible to send requests like: Could you find some book written by Fernando Pessoa?, Search all books whose author is Charles Bukowski, Give me the book whose ISBN code is 0-13-273350-1, Look for some book whose subject is fantasy Clearly, AIML categories do not fit well for user requests that exhibit a high level of ambiguity due to the peculiar characteristics of human language interaction. 2.3 The Behavior Generator The Behavior Generator aims at managing the consistency between the facial expression of the character and the conversation tone. The module is mainly based on Learning Classifier Systems (LCS), a machine learning paradigm introduced by Holland in 1976 [12]. The learning module of SAMIR has been implemented through an XCS [13], a new kind of LCS, which differs in many aspects from the traditional Holland's framework. The most appealing characteristic of this system is that it is very related to the Q-learning but it can generate task representations which can be more compact than tabular Q-learning [14]. At discrete time intervals, the agent observes a state of the environment, takes an action, observes a new state and finally receives an immediate reward. The basic components of an XCS are: Performance Component, that, on the ground of the detected state of the environment, selects the better action to be performed; Reinforcement Component, whose aim is to evaluate the reward to be 1254 F. Zambetta et al. assigned to the system; Discovery Component which, in case of degrading performance, is devoted to the evolution of new, more performing rules. The environment in which SAMIR has to act is represented by the user dialogue (the higher the user satisfaction the higher the reward received from SAMIR). At the very beginning of its life, the behavior of SAMIR is controlled by a set of random generated rules and, consequently, its capability is very low. Behavior rules are expressed in the classical format if <condition> then <action>, where <condition> (the state of the environment) represents a combination of 4 possible events, sensed by 4 effectors, representing different conversation tones such as: user salutation (user performs/does not perform salutation), user request formulation to the agent (no request, polite, impolite), user compliments/insults to the agent (no compliment, a compliment, an insult, a foul language), user permanence in the Web page (user changes/does not change the page) while <action> represents the expression that the Animation System displays during user interaction. In particular, the expression is built as a linear combination of a set of fundamental expressions that includes the basic emotion set proposed by Paul Ekman, namely anger, fear, disgust, sadness, joy, and surprise [7]. Other emotions and many combinations of emotions have been studied but remain unconfirmed as universally distinguishable. However, the basic set of expressions has been extended in order to include some typical human expression such as bother, disappointment and satisfaction. The Behavior Manager, as explained above, is able to produce synthetic facial expressions, to be shown according to the content of the ongoing conversation. Thus the <action> part provides the Animation System with the percentage of each one of the expressions, to be used to compose the desired expression of our character. For example, an expression composed by 40% of joy and 60% of surprise is coded into the following string: 0100 0000 0110 0000 0000 0000 0000 0000 0000 % Surprise % Sadness % Joy % Fear % Disgust % Anger % Bother % Disappointment% Satisfaction During its life, SAMIR performs several interactions with different users and, on the ground of the received reward, the XCS evolves better behavior rules in order to achieve ever better performance. In a preliminary phase we defined a set of 30 interaction rules in which different situations in the course of an interaction (actions performed, users requests, etc.) have been considered. This set of pre-defined rules represented the training set that is the minimal know-how that SAMIR should possess to start its work in the Web site. To evaluate the performance of the system we performed some preliminary experiments aiming at verifying the effectiveness of the User Interaction module to learn a set of 30 predefined rules. Due to the inherent features of the XCS, SAMIR has been able to learn quite effectively the pre-defined rules of behavior and to generalize some new behavioral pattern that could update the initial set of rules. In such a way, SAMIR is comparable with a human assistant that, after a preliminary phase of training, will continue to learn new rules of behavior on the basis of personal experiences and interaction with human customers. SAMIR: Your 3D Virtual Bookseller 3 1255 Experimental Results In this section we present some experimental results obtained from the interaction between SAMIR and some typical users searching for books about topics like literature, fantasy and horror or for more specific books whose information like title, author and publisher are given. When the user connects to the site, SAMIR presents itself and asks to the user his/her name for user authentication and recognition. Figure 4 shows the results of a user request about a horror book. The result of the query is a set of books in this genre available at the Amazon book site. It can be noticed that the book ranked first is by the author Stephen King. Figure 5 is an example of a more sophisticated query in which there is a request for Henry Miller books published after 1970. In this case the user gives a heavy insult to SAMIR and, consequently, its expression is angry. Fig. 4. Requesting an horror book 4 Conclusions In this paper we presented a first prototype of a 3D agent able to support users in searching for books into a Web site. Actually, the prototype has been linked to a specific site but we are currently implementing an improved version that will be able to query several Web bookstores simultaneously and to report, to users, a sort of comparison based on different criteria such as prices, delivery times, etc. Moreover, our work will be aimed to give a more natural behavior to our agent. This can be achieved improving dialogues, and eventually, the text processing 1256 F. Zambetta et al. capabilities of the ALICE chatterbot, and giving the agent a full proactive behavior: the XCS should be able not only to learn new rules to generate facial expressions but also to modify dialogue rules, to suggest interesting links and to supply an effective help during the site navigation. Fig. 5. All books by Henry Miller published after 1970 References [1] [2] [3] [4] [5] [6] [7] J. Cassell et al. (Eds.), Embodied Conversational Agents. MIT Press, Cambridge, 2000. M. Jalali-Sohi & F. Baskaya, A Multimodal Shopping Assistant for Home ECommerce. In: Proceedings of the 14th Int‘l FLAIRS Conf. (Key West FL, 2001), 2-6. S. Kshirsagar & N. Magnenat-Thalmann, Virtual Humans Personified. In: Proceedings of the Autonomous Agents Conference (AAMAS), (Bologna, 2002). R.R. McCrae & O. P. John, An introduction to the five factor model and its applications, J. of Personality, 60 (1992), 175-215. F. Abbattista, A. Paradiso, G. Semeraro, F. Zambetta: An agent that learns to support users of a web site, in: R. Roy, M. Koeppen, S. Ovaska. T. Furuhashi and F. Hoffmann (Eds.), Soft Computing and Industry: Recent Applications, Springer, 2002, 489-496. B. Fleming & D. Dobbs, Animating Facial Features and Expressions. Charles River Media, Hingham, 1998. P. Ekman, Emotion in the human face. Cambridge University Press, Cambridge, 1982. SAMIR: Your 3D Virtual Bookseller [8] [9] [10] [11] [12] [13] [14] View publication stats 1257 A. Paradiso, F. Zambetta, F. Abbattista., Fanky: a tool for animating 3D intelligent agents. In: A. de Antonio, R. Aylett, D. Ballin, Intelligent Virtual Agents, (Madris, 2001), Springer, Berlin, 242-243. MPEG-4 standard specification. http://mpeg.telecomitalialab.com/standards/mpeg-4/mpeg-4.htm. The Pocket PC website. http://www.microsoft.com/mobile/pocketpc/default.asp. Sederburg, T.W., Free-Form Deformation of Solid Geometric Models, Computer Graphics, 20(4), (1986), 151-160. Holland, J.H.: Adaptation. In R. Rosen and F.M. Snell (Eds.), Progress in Theoretical Biology, New York: Plenum, 1976. Wilson, S.W.: Classifier Fitness based on Accuracy. In: Evolutionary Computation 3/2, (1995), 149-175. Watkins, C.J.C.H., Learning from delayed rewards, PhD thesis, University of Cambridge, Psychology Department, 1989.

Log In

SAMIR: Your 3D Virtual Bookseller

Related topics