End-to-End Trainable Restaurant Chatbot
End-to-End Trainable Restaurant Chatbot
End-to-End Trainable
Chatbot for Restaurant
Recommendations
AMANDA STRIGÉR
AMANDA STRIGÉR
Abstract
Task-oriented chatbots can be used to automate a specific task, such
as finding a restaurant and making a reservation. Implementing such
a conversational system can be difficult, requiring domain knowledge
and handcrafted rules. The focus of this thesis was to evaluate the
possibility of using a neural network-based model to create an end-
to-end trainable chatbot that can automate a restaurant reservation
service. For this purpose, a sequence-to-sequence model was imple-
mented and trained on dialog data. The strengths and limitations of
the system were evaluated and the prediction accuracy of the system
was compared against several baselines. With our relatively simple
model, we were able to achieve results comparable to the most ad-
vanced baseline model. The evaluation has shown some promising
strengths of the system but also significant flaws that cannot be over-
looked. The current model cannot be used as a standalone system
to successfully conduct full conversations with the goal of making a
restaurant reservation. The review has, however, contributed with a
thorough examination of the current system, and shown where future
work ought to be focused.
iv
Sammanfattning
Chatbotar kan användas för att automatisera enkla uppgifter, som att
hitta en restaurang och boka ett bord. Att skapa ett sådant konver-
sationssystem kan dock vara svårt, tidskrävande, och kräva mycket
domänkunskap. I denna uppsats undersöks om det är möjligt att an-
vända ett neuralt nätverk för att skapa en chatbot som kan lära sig att
automatisera en tjänst som hjälper användaren hitta en restaurang och
boka ett bord. För att undersöka detta implementerades en så kallad
“sequence-to-sequence”-modell som sedan tränades på domänspeci-
fik dialogdata. Systemets styrkor och svagheter utvärderades och dess
förmåga att generera korrekta svar jämfördes med flera andra mo-
deller. Vår relativt enkla modell uppnådde liknande resultat som den
mest avancerade av de andra modellerna. Resultaten visar modellens
styrkor, men påvisar även signifikanta brister. Dessa brister gör att sy-
stemet, i sig självt, inte kan användas för att skapa en chatbot som kan
hjälpa en användare att hitta en passande restaurang. Utvärderingen
har dock bidragit med en grundlig undersökning av vilka fel som görs,
vilket kan underlätta framtida arbete inom området.
Contents
1 Introduction 1
1.1 Purpose and Research Question . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Chatbot Models . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Difficulties . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Related Work 8
3.1 Task-Oriented Conversation Systems . . . . . . . . . . . . 8
3.2 End-to-End Trainable Systems . . . . . . . . . . . . . . . . 9
3.3 End-to-End Trainable Task-Oriented Systems . . . . . . . 10
4 Theory 12
4.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . 14
4.3 Sequence-to-Sequence Learning . . . . . . . . . . . . . . . 15
4.4 Memory Networks . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Text Representation . . . . . . . . . . . . . . . . . . . . . . 17
5 Method 19
5.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.4 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 25
5.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.7 Chat Program . . . . . . . . . . . . . . . . . . . . . . . . . 27
v
vi CONTENTS
6 Results 28
7 Discussion 31
Bibliography 36
A Example Dialogs 41
B Prediction Errors 49
Chapter 1
Introduction
Chatbots have a wide variety of possible use cases and are of interest
to both the industry and researchers. More and more companies are
considering adding bots to their messaging services. Chatbots are use-
ful as they can answer questions and assist customers without delay
and at all hours. The question is whether chatbot design techniques
have become advanced enough for us to create bots that are useful in
practice.
Chatbots of varying degree of intelligence have existed for years,
the first appearing in the 60’s [31]. These first ones used simple key-
word matching to generate an answer. Since then much has hap-
pened and many more chatbots have been developed using different
techniques such as pattern matching, keyword extraction, natural lan-
guage processing, machine learning, and deep learning.
There are generally two different types of chatbots in regards to the
intended use case, conversational bots and task-oriented bots. Con-
versational bots aim to entertain the user by simulating a general con-
versation, usually without a specific end goal. In this case, longer con-
versations are often an indication of a bot performing well. The aim of
a task-oriented bot is, on the other hand, to interpret and perform the
action the user is requesting as fast as possible.
Chatbots have many possible fields of application and especially
task-oriented chatbots have a clear utility in real applications as they
are built to help users achieve some specific goal such as to make a
restaurant reservation, order a pizza, book a flight, and so on. Using
bots instead of human assistants could reduce the time and effort re-
quired to get help with the specific task at hand.
1
2 CHAPTER 1. INTRODUCTION
Background
3
4 CHAPTER 2. BACKGROUND
2.2 Datasets
The availability of datasets to train the conversation system on varies
depending on what type of data you need. For general conversational
systems, different forums are often used [18, 23]. Publicly available
task-oriented data is however scarce. Task-oriented data is often col-
lected by companies or collected using Wizard of Oz [7] techniques
and crowdsourcing [32]. Using this last-mentioned technique, data
is obtained by letting users chat with what they believe to be an au-
tonomous system which in reality is controlled by a human. An al-
ternative to data collected from the real world is to use synthetically
generated data such as the dataset created by Bordes et al. [3]. There
are advantages and disadvantages with using both real and synthetic
data. With synthetic data, you have more control over the test setting
and what you are testing but it is often not representative for a real
use case. Real data provides an accurate setting for the usage of the
conversational system but contains different difficulties such as gram-
mar mistakes, slang, abbreviations, and other human errors. While
we want the system to be able to handle these conversations they can
make the training more difficult. One way to improve the performance
is to do some preprocessing of the raw data. A few examples of what
can be done is to convert contractions to full words, do spelling and
grammar correction, convert all letters to lower case, remove punctu-
ation, and replace words such as names or numbers with a common
token, e.g., <person>, <number>, and so on [24].
2.3 Evaluation
When developing a conversation system, an important question is how
to evaluate its performance. Different systems have different proper-
ties and might need to focus on different aspects in the evaluation.
This could be user entertainment or satisfaction, goal achievement,
grammar, consistency, and so on.
One way to evaluate a conversation system is to use humans to
judge the quality of the system. While possibly being the most ac-
CHAPTER 2. BACKGROUND 5
2.4 Difficulties
When creating a chatbot there are several issues and difficulties that
need to be addressed. One difficulty is to keep track of user intent,
i.e., what the user wants [10]. Since we are using natural language to
communicate, there is a multitude of ways the user can express an in-
6 CHAPTER 2. BACKGROUND
tent and the system has to understand them all. If the system asks “do
you have any preferences on a type of cuisine?” the user can answer
in a myriad of ways which all have the same meaning. In addition,
the user intent can depend on multiple variables and change over the
course of the conversation.
Utterances are often dependent on the context in which they were
given [26]. Both the context of what has been said previously in the
conversation and the physical context can be relevant. The physical
context is for example location, time, and information about the user.
Keeping track of this context can be challenging as it grows. With a
large context, it becomes more difficult to identify what parts of the
context are relevant to the current question.
Another issue is the tendency of systems trained on real data to
give common or generic responses [15]. Generic responses can be
used in most situations but do not contribute to the conversation in
any meaningful way. The system favors these safe options as the ob-
jective during training is to maximize the probability of success. Com-
mon expressions such as “yes”, “no”, and “thanks” are generally over-
represented in conversational data. This causes the system to see them
as more likely and therefore a safe bet to use as an answer to almost
anything. These problems can be combated by using different nor-
malization techniques and anti-language models that are designed to
penalize high-frequency and generic responses [15].
Creating a system that can respond in a consistent way or have a
personality is difficult for learning systems [16]. This is because con-
versational systems are often trained on dialog data collected from
many different users. As a result, the personality of the chatbot be-
comes a combination of all of them. The same question formulated
in different ways can, therefore, receive very different responses. For
example, the question “where were you born” can generate the an-
swer “Stockholm” while the question “where do you come from” can
receive the answer “Madrid”. Personality and consistency are impor-
tant if a chatbot is to seem human. Consistency is necessary for task-
oriented chatbots since the user’s goal should be achieved regardless
of how it is formulated.
Chatbots, especially task-oriented bots, often need to be able to
interact with a knowledge base to be able to use information about
the world. There exist different ways to manage this integration but
it is extra challenging for end-to-end trainable systems. The interac-
CHAPTER 2. BACKGROUND 7
tion with the knowledge base often breaks the differentiability of the
system which can then no longer be trained purely end-to-end. Solu-
tions include having several end-to-end trainable modules and train-
ing them separately [32] or to augment the conversations with the calls
to the knowledge base [3].
2.5 Ethics
Chatbots can be used for many purposes and not all of them are good.
Most people have probably heard of or encountered spambots. Spam-
bots are programs that are designed to automatically send out unso-
licited messages through different channels. The goal is usually to ad-
vertise, boost a site’s search engine ranking, or to trick users in some
way. As chatbots become more advanced they can become difficult to
detect and thus more successful at scamming people.
Another issue that often arises when discussing artificial intelli-
gence is the possibility of computers replacing humans by automating
their work. This discussion very much applies to task-oriented chat-
bots as their entire purpose is to automate tasks. Automation does not,
however, have to be a bad thing. If chatbots can automate simple and
repetitive tasks, it can free up time for the human workers to focus
on the more advanced cases. Workers will have to deal with less bor-
ing and repetitive tasks and users can get quicker responses and easier
access to services.
Chapter 3
Related Work
In this chapter, a few conversation systems are presented and the dif-
ferent models used in the works are briefly discussed. First conversa-
tion systems used in task-oriented domains are presented. Following,
different end-to-end trainable systems are introduced and finally, two
end-to-end trainable systems for task-oriented conversations are dis-
cussed.
8
CHAPTER 3. RELATED WORK 9
ing adequate rules can be difficult even in narrow areas and will soon
become impossible in broader domains.
These types of models require specific domain knowledge and of-
ten massive amounts of manual design to work well. While capable
of giving good results in limited domains these models are not usable
across domains and have difficulties handling broad domains.
ers, by Lowe et al. [18]. The dual-encoder encodes both the context,
i.e., the conversation so far, and a possible response. Given the final
hidden states from both RNNs, the probability that they are a valid
pair is calculated, in other words, how likely the response is given the
context.
work, the database, and the belief state. Each module can be trained
directly from conversation data except the database operator since the
database attributes are explicitly stated. This breaks the differentiabil-
ity of the system and is the reason why the components have to be
trained separately. While the system has an explicit representation of
the database attributes, the intent network has a distributed represen-
tation of the user intent which allows ambiguous input.
Chapter 4
Theory
12
CHAPTER 4. THEORY 13
x1
y1
x2
y2
x3
In Figure 4.2 and Figure 4.3, x is the input at each time step, and o is
the output calculated based on the hidden state so that ot = f (V ht ) [5,
Ch. 10]. The hidden state h is calculated based on the previous hidden
state and the current input, i.e., ht = f (U xt + W ht−1 ) [5, Ch. 10]. U ,
V , and W are the network parameters to be learned during training.
Output and input can be given at each time step but might not always
be necessary. Depending on what the network should do there could
be one or many inputs and one or many outputs. An example where
there are many inputs but only one output is in sentiment analysis,
where a sentence is classified as either having a positive or a negative
CHAPTER 4. THEORY 15
sentiment. The sentence is fed to the network word by word and af-
ter the whole sequence of words has been processed the output is the
sentiment classification. A case where there are multiple outputs but
only one input is when RNNs are used to generate image descriptions.
Here the input is an image and the output is a sequence of words de-
scribing the image.
In RNNs the network parameters are shared by all time steps [5, Ch. 10].
Because the parameters are shared the gradients depend on previous
time steps. To train an RNN an algorithm called backpropagation
through time (BPTT) is used [5, Ch. 10]. BPTT works just like the reg-
ular backpropagation algorithm on an unrolled RNN.
When using the backpropagation algorithm there are difficulties
in learning long-term dependencies [2]. Since backpropagation com-
putes gradients using the chain rule, many small numbers are multi-
plied together causing the gradient of layers far from the final layer
to become extremely small or vanish. There are ways to avoid this
problem such as using gated RNNs, e.g., long-short term memories
(LSTMs) [9] or the more recent gated recurrent units (GRUs) [4].
LSTMs were first introduced by Hochreiter and Schmidhuber [9]
and are a type of RNN that are designed to better handle long-term
dependencies. The difference from a regular RNN is how the hidden
state is computed. LSTMs have an internal cell state and have the abil-
ity to add or remove information from the state. What to add or re-
move is regulated by structures called gates. Generally, there are three
gates called input, forget and output gates. The input and forget gates
control the cell state by deciding how much of the new input to add
to the state and what to throw away from the cell state. The output
gate decides what to output based on the cell state. These gates al-
low LSTMs to handle long-term dependencies. The LSTM learns how
its memory should behave by learning the parameters for the gates.
There exist many different versions of the LSTM network. Some of
them are evaluated in the work by Greff et al. [6].
This basic idea can be used for many advanced tasks. Two examples
are machine translation [28] and conversational models [29]. In ma-
chine translation the goal is to map a sequence of words in one lan-
guage to the corresponding sequence of words in another language.
In conversational systems the aim is instead to translate from an input
sentence to a system response.
Sequence-to-sequence models exist in many variations but the ba-
sic structure consists of two RNNs [4]. One encodes the input into a
vector representation capturing the meaning and important informa-
tion of the input sequence. The other RNN then takes this vector as
input and uses it to produce the output sequence. This process is de-
picted with the unrolled RNNs in Figure 4.4.
In the figure, each box is an RNN cell, such as LSTM or GRU. The
encoder reads the input, token by token until it gets a special token
marking the end of the input. In the figure, this token is <EOS>. Based
on the context generated by the encoder, the decoder generates the
output sentence, token by token until it generates a stop token. At each
time step, the previously generated token is fed back into the network
as the input.
At each time step the decoder produces a list of probabilities over
the vocabulary [28]. These probabilities can be used to either generate
a sequence, token by token, or calculate the probability of an existing
sequence. To generate a sequence in a greedy manner you select the
token with the highest probability at each time step and feed it back
into the network to generate the next token [28]. Instead of just con-
sidering the most probable token at each step, the top x tokens can
be examined. In this way we can use beam-search to find the most
probable sequence of tokens. To calculate the probability of an ex-
isting sequence given the input sequence we estimate the conditional
probability p(y1 , y2 , ..., ym |x1 , x2 , ..., xn ) [28], where xi is the input to-
kens and yi is the output tokens. In the example in Figure 4.4 this
CHAPTER 4. THEORY 17
Method
19
20 CHAPTER 5. METHOD
Table 5.1: An example of a conversation between a user and the system. The
conversation has the same structure as the conversations in the dataset
by Bordes et al. [3]. The system responses are in italic. The rows beginning
with a star are results from an API call. The dialog is divided into five tasks
that highlight different difficulties and essential capabilities of a chatbot.
Task 1 Hi!
Issuing API calls Hello, what can I help you with today?
Can you make a restaurant reservation in Rome for eight with Italian
cuisine.
I’m on it!
Which price range are you looking for?
In a cheap price range please.
Ok, let me look into some options for you.
api_call(Rome, Italian, Eight, Cheap)
Task 2 Actually I would prefer for Six.
Updating API calls Sure. Is there anything else to update?
No.
Ok, let me look into some options for you.
api_call(Rome, Italian, Six, Cheap)
Task 3 * Restaurant1 R_phone phone1
Displaying options * Restaurant1 R_cuisine italian
* Restaurant1 R_address address1
* Restaurant1 R_location Rome
* Restaurant1 R_number six
* Restaurant1 R_price cheap
* Restaurant1 R_rating 6
* Restaurant2 R_phone phone2
* Restaurant2 R_cuisine italian
* Restaurant2 R_address address2
* Restaurant2 R_location Rome
* Restaurant2 R_number six
* Restaurant2 R_price cheap
* Restaurant2 R_rating 4
What do you think of this option: Restaurant1
No, I don’t like that.
Sure, let me find another option for you.
What do you think of this option: Restaurant2
Let’s do it!
Great, let me do the reservation.
Task 4 Can you provide me the address?
Provide information Here it is: address2
You rock!
Is there anything else I can help you with?
No thanks.
You’re welcome.
Task 5
Full dialog
CHAPTER 5. METHOD 21
5.2 Data
Data collected from real conversations usually contains various noise
such as spelling errors, wrong information, interpretation errors and
so on which makes the training more difficult. Larger or more open
domains are also more difficult to learn from as the variation of the
dialogs are greater and the number of possible answers increases. To
limit the difficulties presented by the data itself the chatbot was trained
and tested in a very limited domain using mostly synthetically gener-
ated data. Using synthetic data allows for more control over the format
of the dialog which leads to a more straightforward evaluation of the
strengths and limitations of the system. To further test the system in
a more realistic setting, data adapted from real human-bot conversa-
tions is used.
The conversation data used for training and evaluating the system
was created by Bordes et al. [3]. The dataset is an open resource to
test end-to-end goal-oriented dialog systems in a restaurant booking
domain. The data consists of six different subsets that all aim to test
different abilities that are deemed essential for a chatbot in a task-
oriented domain. These subsets are referred to as tasks 1–6. Tasks
1–5 consist of synthetically generated dialogs, while task 6 consists of
real human-bot conversations. The data used in task 6 is an adapta-
tion of the restaurant reservation dataset from the second dialog state
tracking challenge (DSTC2) [8]. The dataset was originally designed
to train and test dialog state tracker components. A state tracker keeps
track of an estimated dialog state at each step of the conversation in
order to choose actions, such as what restaurant to suggest.
The different tasks are described below. An example containing
parts of tasks 1–5 is provided in Table 5.1. Examples of each individual
task can be found in Appendix A.
• Task 5: conduct full dialogs, combines tasks 1–4 to test the sys-
tem’s ability to carry out a complete conversation that contains
all difficulties presented by each subtask.
All tasks are further split into three separate datasets for training,
development, and testing. We used the same split as the creators of
the dataset [3]. For tasks 1–5, these sets all contain 1,000 dialogs each.
Task 6 has more conversations but uses a different split, using more
conversations for training and fewer for development. More statistics
on the datasets are presented in Table 5.2 and Table 5.3. Tasks 1–5
have two separate test sets, one that uses the same vocabulary as the
training set and one that uses words that have not been seen during
training to test how well the system can generalize to new tokens. This
test is called an out of vocabulary (OOV) test [3]. Both test sets consist
of 1,000 dialogs each. Task 6 does not have this kind of test set.
The dataset also has a knowledge base related to tasks 1–5. This
knowledge base is a text file containing information about all restau-
rants that appear in the dataset. The knowledge base contains 1200
CHAPTER 5. METHOD 23
5.3 Evaluation
The goal of the chatbot is to assist the user in booking a table at a
restaurant. Its performance was evaluated by measuring the system’s
ability to predict the next response, given the conversation so far. The
correct response in this evaluation is considered to be the response
from the actual dialog in the dataset. For example, consider the con-
versation consisting of the utterances {u1 , b1 , u2 , b2 , ..., un , bn }, where ui
is a user utterance and bi is a bot utterance. This conversation will be
split into c1 = {u1 }, r1 = {b1 }, c2 = {u1 , b1 , u2 }, r2 = {b2 } and so on,
where ci is the context and ri is the correct response given that context.
To only consider the response from the actual dialog to be the cor-
rect response results in a rather strict evaluation. This might be un-
desirable if multiple answers can be equally correct. For example, it
would be strange to say that “thank you” is incorrect if the correct
answer is “thanks”. As it turns out, however, we do not have this
ambiguity in our dataset. The number of different bot utterances is
limited and the bot does not express the same thing in different ways,
and thus a response can be considered as correct or incorrect without
ambiguity.
The response accuracy is measured both for each response and for
the entire conversation. The response accuracy measures the percent-
age of responses where the bot was able to predict the correct response.
The conversation accuracy is calculated as the number of conversa-
tions where every response was predicted correctly. Every response is
required to be correct because even just one mistake can keep the user
from completing their task. These evaluation metrics are the same as
in the study by Bordes et al. [3] and the results are compared against
the baselines provided in that paper.
5.4 Model
The goal of the system is to predict the correct response given the
context. During training the goal is, therefore, to train the model to
give the observed response as high probability as possible by maxi-
mizing the probability of the observed response given the context, i.e.,
p(response|context).
Both the context and response can be seen as a sequence of words,
CHAPTER 5. METHOD 25
which changes the goal to predicting the most likely sequence of to-
kens given a previously seen sequence of tokens, i.e., maximizing
p(r1 , ..., rn |c1 , ..., cm ). Given this formulation of the system goal, a
seemingly fitting model to use is sequence-to-sequence learning [4]
(described in Section 4.3). This is a well-known model that exists in
many variations and has been used successfully for several natural
language tasks.
The responses are generated using the greedy approach described
in Section 4.3, creating the response word by word by selecting the
word with the highest probability at each step. This differs from how
the responses are predicted in the work by Bordes et al. [3]. In their
work a response is selected from a set of candidates. This set of can-
didate responses consists of all bot utterances from the entire dataset
(including the test set). The probability of each candidate given the
context is calculated and the highest scoring response is selected as
the answer.
A retrieval based inference function can be implemented for a
sequence-to-sequence model (as mentioned in Section 4.3). The reason
for choosing a generative approach instead is mainly because of the
simplicity and the computational difference. A retrieval based func-
tion has to calculate a score for every possible candidate response to
find the one with the highest probability. A generative function cre-
ates a response by selecting the highest scoring word at each time
step. Thus, a generative approach can require much less computa-
tional time, especially if the candidate set is large. There exist ways
to speed up a retrieval based approach, such as using clustering tech-
niques to first find a subset of candidate responses to select an answer
from [10], but the simplicity of a generative approach made it the pre-
ferred option. Even though the prediction is done differently the accu-
racy results should still be comparable.
5.5 Implementation
The system was implemented using the open-source machine learning
library Tensorflow [20]. The sequence-to-sequence model is based on
the tf.contrib.seq2seq module in Tensorflow 1.0.
Both the encoder and decoder consist of a dynamic RNN, mean-
ing that they put no restriction on the batch size or sequence length.
26 CHAPTER 5. METHOD
5.6 Training
The model was trained using weighted cross-entropy loss for se-
quences of logits using the Tensorflow function sequence_loss defined
in the module tf.contrib.seq2seq. The loss is minimized using the Adam
optimizer as described by Kingma and Ba [11]. Adam is “an algorithm
for first-order gradient-based optimization of stochastic objective func-
tions, based on adaptive estimates of lower-order moments” [11]. The
Tensorflow implementation defined in the tf.train module was used
with a learning rate of 0.001.
The model was trained for several epochs, i.e., several runs over
all training examples. After each epoch, the loss and accuracy of the
model were calculated on the validation dataset. The training was
terminated when the accuracy no longer improved. The validation
dataset was also used to identify parameters of the model such as em-
bedding size and number of units in the encoder and decoder. The
CHAPTER 5. METHOD 27
model that achieved the best accuracy on the validation set was then
used to calculate the accuracy on the test sets.
Results
The accuracy results of our model are presented in the last column
(seq-to-seq) of Table 6.1. On task 1 (issuing API calls) and task 2
(updating API calls), our sequence-to-sequence model achieves 100%
accuracy which means that every response was predicted correctly.
When testing using words that were not seen during training, the ac-
curacy per response sunk to around 80% and the accuracy per conver-
sation to 0%. For both tests, the errors on the out of vocabulary test set
were almost exclusively API calls with the wrong entity, i.e., “api_call
italian madrid two expensive” instead of “api_call thai tokyo two ex-
pensive”. Examples of prediction errors can be found in Appendix B.
For task 3 (displaying options) and task 4 (provide extra information),
the accuracy is almost the same for both the regular test set and the
out of vocabulary test set. On both tasks, we get 0% per conversation
accuracy. Most errors are again caused by the use of the wrong entity.
The errors, on task 3, have the form “what do you think of this option
wrong_resturant”. The results from an API call, that are present in the
context, are often ignored and some restaurant from the training data
is used instead. On task 4, all errors occur when the user asks for in-
formation about a restaurant, i.e., an address or phone number. The
system either gives the wrong information or none at all.
28
Table 6.1: Accuracy results for several different models from Bordes et al. [3] combined with results from our
sequence-to-sequence model in the last column. The accuracy of each model is given as percentages per response and per
conversation (shown in parenthesis). Best performing learning models, considering per response accuracy, are marked in
bold. If the difference is less than 0.1% both results are marked. The test sets that use words that are not in the training data
are marked by OOV (out of vocabulary).
CHAPTER 6. RESULTS
Task 6 33.3 (0) 1.6 (0) 21.9 (0) 22.6 (0) 41.1 (0) 41.0 (0) 47.1 (1.8)
Task 1 (OOV) 100 (100) 5.8 (0) 44.1 (0) 60.0 (0) 72.3 (0) 96.5 (82.7) 81.1 (0)
Task 2 (OOV) 100 (100) 3.5 (0) 68.3 (0) 68.3 (0) 78.9 (0) 94.5 (48.4) 78.9 (0)
Task 3 (OOV) 100 (100) 8.3 (0) 58.8 (0) 65.0 (0) 74.4 (0) 75.2 (0) 75.3 (0)
Task 4 (OOV) 100 (100) 9.8 (0) 28.6 (0) 57.0 (0) 57.6 (0) 100 (100) 57.0 (0)
Task 5 (OOV) 100 (100) 4.6 (0) 48.4 (0) 58.2 (0) 65.5 (0) 77.7 (0) 66.9 (0)
29
30 CHAPTER 6. RESULTS
On task 5 (conduct full dialogs), our system achieved 99% per re-
sponse accuracy and 85% per conversation. Similar to previous tasks,
task 5 makes errors when using entities. For the regular test set, all
errors were caused by providing the wrong address or phone number
to a restaurant. On the out of vocabulary test we still have issues with
using the correct entity and making errors when issuing API calls, pro-
viding information, and suggesting restaurants. Also, we make some
other errors where the first word of the sentence is wrong, e.g., “where
price range are you looking for” and “any many people would be in
your party”.
The errors on task 6 are more diverse. Commonly the prediction is
close to the expected answer but with a few wrong words. In other
cases the prediction is completely nonsensical, such as “would are
restaurants in the expensive of town do tasty kind of food would area
range”.
The accuracy of our sequence-to-sequence model is compared
against several baseline systems, as shown in Table 6.1. The baseline
results are taken from Bordes et al. [3]. The baseline models consist of
a rule-based system, two classical information retrieval models, and
two network based models. For more details on the implementation
of the baseline models, please refer to the paper by Bordes et al. [3].
The sequence-to-sequence model outperforms the traditional infor-
mation retrieval models on all tasks. It also gets equal or better results
compared to the supervised embedding model. The accuracy of our
model is similar to that of the memory network without match type
features. With match type features, memory networks achieve the best
result on most tasks, the only exceptions being tasks 5 and 6 where our
model performs slightly better. Match type features seem to improve
the result particularly on the out of vocabulary tests. The rule-based
system achieves 100% accuracy on tasks 1–5 on both test sets, but has
worse results on task 6.
Chapter 7
Discussion
31
32 CHAPTER 7. DISCUSSION
difficulty with using results from API calls to suggest restaurants and
provide information about them. For these two tasks it does not seem
to make a difference if the system has seen the entities during training
or not, as we get 0% conversation accuracy on both the OOV tests and
the regular tests. Both task 3 and task 4 struggle with the same short-
coming of the system but get quite different per response accuracies.
This is most likely because of the different lengths of the conversations
in both tasks. As presented in Table 5.3, task 3 consists of conversations
with 17 utterances on average while task 4 has an average of eight (not
counting results from API calls). On task 5, where we conduct full
dialogs, the system still makes the same errors but much more rarely.
Rule-based systems have the best performance on most tasks. This
is probably because tasks 1–5 are synthetically generated, have a sim-
ple structure, and are thus predictable and not too difficult to create
rules for that can achieve 100% accuracy. On task 6, that consists of
human-bot conversations, both memory networks and our sequence-
to-sequence model outperform the rule-based system. This shows that
neural networks are useful in real-life situations where the input is
more varied and the conversations are less structured. A network can
learn to deal with situations that we have a hard time predicting in ad-
vance, and thus creating rules for. This suggests that machine learning
is useful in real situations where we have non-synthetic or unstruc-
tured data.
While performing better than the rule-based system on task 6, our
model still achieves an accuracy of less than 50%. Being able to answer
correctly less than half of the time is not sufficient for use in a real-
world application. Even one error in a conversation, such as making
the incorrect API call, can prevent the user from completing their task.
A per conversation accuracy of 1.8% is clearly unacceptable. The re-
sults do however indicate that rule-based and network-based systems
might have different strengths and weaknesses. Thus a combination
of the two might be a good idea. We could use rules for situations that
are easy to predict and write rules for, while the network can learn to
deal with unexpected situations or situations where the input is hard
to define with a rule-based system. This could be to learn to recognize
greetings, to find entities in a request, and so on.
Chapter 8
33
34 CHAPTER 8. CONCLUSIONS AND FUTURE WORK
and shown where future work might make the most difference. The
results could also provide a baseline for future work.
the user feels when using the system is just as important. The chat
application is, however, rather minimal at the moment. As the goal
of this thesis was to evaluate the power of using neural networks to
create a chatbot, the application was made using almost solely the
sequence-to-sequence model and nothing else. Therefore there are
many improvements that can lead to enhanced user experience. One
simple improvement is to calculate the probability of our response
at each step of the conversation and give the user a warning if the
probability is below some threshold. This message could be of the
form “Sorry, I’m not certain that I understood your request but my
best guess is: ...”. This limits the damage if the system is wrong and
gives the user a hint of what kind of questions the system can answer
with certainty and which it cannot. This threshold has to be set ap-
propriately so that the message does not show too often and creates
additional annoyance instead of being helpful. Sometimes the chat-
bot responds in a way that did not further the user towards their goal.
When this faulty response is entered into the context it could confuse
the chatbot when making future predictions. A way to combat this
could be to detect when we made an error and not append that part to
the context. An alternative is to train the system to recover from this
state by training it on these types of conversations as well.
Above we have mentioned some ideas that build on our findings
in this thesis but there are of course much more possibilities for future
work in this research area.
Bibliography
36
BIBLIOGRAPHY 37
[16] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill
Dolan. “A Persona-Based Neural Conversation Model”. In:
CoRR abs/1603.06155 (2016). URL: http://arxiv.org/abs/
1603.06155.
[17] Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Nosewor-
thy, Laurent Charlin, and Joelle Pineau. “How NOT To Eval-
uate Your Dialogue System: An Empirical Study of Unsuper-
vised Evaluation Metrics for Dialogue Response Generation”.
In: CoRR abs/1603.08023 (2016). URL: http : / / arxiv . org /
abs/1603.08023.
[18] Ryan Lowe, Nissan Pow, Iulian V. Serban, Laurent Charlin,
Chia-Wei Liu, and Joelle Pineau. “Training End-to-End Dialogue
Systems with the Ubuntu Dialogue Corpus”. In: Dialogue & Dis-
course 8.1 (2017), pp. 31–65.
[19] Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Lau-
rent Charlin, and Joelle Pineau. “On the Evaluation of Dia-
logue Systems with Next Utterance Classification”. In: CoRR
abs/1605.05414 (2016). URL: http://arxiv.org/abs/1605.
05414.
[20] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo,
Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jef-
frey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,
Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia,
Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Lev-
enberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray,
Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya
Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay
Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Mar-
tin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Sys-
tems. Software available from tensorflow.org. 2015. URL: http:
//tensorflow.org/.
[21] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and
Jeff Dean. “Distributed Representations of Words and Phrases
and their Compositionality”. In: Advances in Neural Information
Processing Systems. 2013, pp. 3111–3119.
BIBLIOGRAPHY 39
[22] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.
“BLEU: A Method for Automatic Evaluation of Machine Trans-
lation”. In: Proceedings of the 40th Annual Meeting on Association
for Computational Linguistics. ACL ’02. Philadelphia, Pennsylva-
nia: Association for Computational Linguistics, 2002, pp. 311–
318. DOI: 10.3115/1073083.1073135.
[23] Alan Ritter, Colin Cherry, and William B. Dolan. “Data-driven
Response Generation in Social Media”. In: Proceedings of the
Conference on Empirical Methods in Natural Language Processing.
EMNLP ’11. Edinburgh, United Kingdom: Association for Com-
putational Linguistics, 2011, pp. 583–593. ISBN: 978-1-937284-11-
4.
[24] Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron
Courville, and Joelle Pineau. “Building End-to-end Dialogue
Systems Using Generative Hierarchical Neural Network Mod-
els”. In: Proceedings of the Thirtieth AAAI Conference on Artifi-
cial Intelligence. AAAI’16. Phoenix, Arizona: AAAI Press, 2016,
pp. 3776–3783.
[25] Lifeng Shang, Zhengdong Lu, and Hang Li. “Neural Re-
sponding Machine for Short-Text Conversation”. In: CoRR
abs/1503.02364 (2015). URL: http://arxiv.org/abs/1503.
02364.
[26] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brock-
ett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao,
and Bill Dolan. “A Neural Network Approach to Context-
Sensitive Generation of Conversational Responses”. In: CoRR
abs/1506.06714 (2015). URL: http://arxiv.org/abs/1506.
06714.
[27] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob
Fergus. “End-To-End Memory Networks”. In: Proceedings of the
28th International Conference on Neural Information Processing Sys-
tems. NIPS’15. Montreal, Canada: MIT Press, 2015, pp. 2440–
2448.
[28] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. “Sequence to Se-
quence Learning with Neural Networks”. In: Advances in Neural
Information Processing Systems. 2014, pp. 3104–3112.
40 BIBLIOGRAPHY
Example Dialogs
41
42 APPENDIX A. EXAMPLE DIALOGS
Table A.1: An example of a dialog from task 1 (issuing API calls) [3].
hi
hello what can i help you with today
can you book a table for two with indian food
i’m on it
<silence>
where should it be
madrid please
which price range are you looking for
i am looking for a cheap restaurant
ok let me look into some options for you
<silence>
api_call indian madrid two cheap
Table A.2: An example of a dialog from task 2 (updating API calls) [3].
hi
hello what can i help you with today
i’d like to book a table in a cheap price range for six people with span-
ish cuisine in madrid
i’m on it
<silence>
ok let me look into some options for you
<silence>
api_call spanish madrid six cheap
instead could it be for four
sure is there anything else to update
actually i would prefer in rome
sure is there anything else to update
no
ok let me look into some options for you
<silence>
api_call spanish rome four cheap
thanks
you’re welcome
APPENDIX A. EXAMPLE DIALOGS 43
Table A.3: An example of a dialog from task 3 (displaying options) [3]. The
dialog starts with results from an API call for the system to use in the task.
hello
hello what can i help you with today
i’d like to book a table in madrid with italian cuisine for two in a
expensive price range
i’m on it
<silence>
ok let me look into some options for you
<silence>
what do you think of this option: the_fancy_place
no this does not work for me
sure let me find an other option for you
<silence>
what do you think of this option: the_nice_place
i love that
great let me do the reservation
44 APPENDIX A. EXAMPLE DIALOGS
hi
hello what can i help you with today
can you book a table at restaurant1
great let me do the reservation
can you provide the address
here it is address1
may i have the phone number of the restaurant
here it is phone1
APPENDIX A. EXAMPLE DIALOGS 45
Table A.5: An example of a dialog from task 5 (conduct full dialogs) [3].
good morning
hello what can i help you with today
i’d like to book a table for two in london in a moderate price range
i’m on it
<silence>
any preference on a type of cuisine
with spanish cuisine
ok let me look into some options for you
<silence>
api_call spanish london two moderate
instead could it be with british food
sure is there anything else to update
no
ok let me look into some options for you
<silence>
api_call british london two moderate
restaurant1 R_phone phone1
restaurant1 R_cuisine british
restaurant1 R_address address
restaurant1 R_location london
restaurant1 R_number two
restaurant1 R_price moderate
restaurant1 R_rating 5
...
<silence>
what do you think of this option: restaurant1
i love that
great let me do the reservation
what is the phone number of the restaurant
here it is phone1
thank you
is there anything i can help you with
no thank you
you’re welcome
46 APPENDIX A. EXAMPLE DIALOGS
Table A.6: A dialog from the training data in task 6 (DSTC2) [3].
<silence>
Hello, welcome to the Cambridge restaurant system. You can ask for restau-
rants by area, price range or food type. How may I help you?
i want a moderately priced restaurant that serves
What kind of food would you like?
canapes food
What part of town do you have in mind?
it doesnt matter
api_call canapes R_location moderate
api_call no result
<silence>
yippee_noodle_bar is a nice restaurant in the centre of town in the moderate
price range
gastropub food
Sorry there is no canapes restaurant in the moderate price range
i want a gastropub restaurant
the cow pizza kitchen and bar is a nice restaurant in the centre of town
serving gastropub food
what is the address
Sure, the cow pizza kitchen and bar is on
the_cow_pizza_kitchen_and_bar_address
what is the phone number
The phone number of the cow pizza kitchen and bar is
the_cow_pizza_kitchen_and_bar_phone
what is the area
the cow pizza kitchen and bar is a nice place in the centre of town
thank you goodbye
you are welcome
APPENDIX A. EXAMPLE DIALOGS 47
Table A.7: A dialog from the training data in task 6 (DSTC2) [3]. In this
conversation, the system fails to suggest a new restaurant to the user.
<silence>
Hello, welcome to the Cambridge restaurant system . You can ask for restau-
rants by area, price range or food type. How may I help you ?
i want a moderately priced restaurant in the west part of town
api_call R_cuisine west moderate
saint_johns_chop_house R_post_code saint_johns_chop_house_post_code
saint_johns_chop_house R_cuisine british
saint_johns_chop_house R_location west
saint_johns_chop_house R_phone saint_johns_chop_house_phone
saint_johns_chop_house R_address saint_johns_chop_house_address
saint_johns_chop_house R_price moderate
saint_johns_chop_house R_rating 3
prezzo R_post_code prezzo_post_code
prezzo R_cuisine italian
prezzo R_location west
prezzo R_phone prezzo_phone
prezzo R_address prezzo_address
prezzo R_price moderate
prezzo R_rating 9
<silence>
prezzo is a nice restaurant in the west of town in the moderate price range
is there anything else
You are looking for a restaurant is that right?
give me a different restaurant
prezzo is a nice restaurant in the west of town in the moderate price range
goodbye
you are welcome
48 APPENDIX A. EXAMPLE DIALOGS
Table A.8: A dialog from the training data in task 6 (DSTC2) [3]. In this
example, the user does not use full sentences.
<silence>
Hello, welcome to the Cambridge restaurant system . You can ask for restau-
rants by area, price range or food type. How may I help you?
restaurant any area kosher food
What kind of food would you like?
kosher
api_call kosher R_location R_price
api_call no result
<silence>
cote is a nice restaurant in the centre of town serving french food
type of food
cote is a great restaurant
phone number
The phone number of cote is cote_phone
thank you good bye
you are welcome
Appendix B
Prediction Errors
49
50 APPENDIX B. PREDICTION ERRORS
Table B.1: Prediction error made by the system on the out of vocabulary test
set of task 1. The words “thai” and “tokyo” have not been seen during
training and thus the system favors the words “italian” and “madrid”
instead.
context: hello
hello what can i help you with today
may i have a table
i’m on it
<silence>
any preference on a type of cuisine
with thai cuisine
where should it be
tokyo please
how many people would be in your party
for two people please
which price range are looking for
i am looking for a expensive restaurant
ok let me look into some options for you
<silence>
correct: api_call thai tokyo two expensive
prediction: api_call italian madrid two expensive
Table B.2: Prediction error made by the system on the out of vocabulary test
set of task 1. Only the first word of the prediction is wrong.
context: hello
hello what can i help you with today
may i have a table in seoul
i’m on it
<silence>
any preference on a type of cuisine
with thai cuisine
correct: how many people would be in your party
prediction: where many people would be in your party
APPENDIX B. PREDICTION ERRORS 51
Table B.3: Prediction error made by the system on task 3. The system does
not use the provided API call results and suggests the wrong restaurant.
hello
hello what can i help you with today
can you book a table in rome in a moderate price range
i’m on it
<silence>
any preference on a type of cuisine
with british food
how many people would be in your party
for two please
ok let me look into some options for you
<silence>
correct: what do you think of this option
resto_rome_moderate_british_8stars
prediction: what do you think of this option
resto_rome_moderate_french_7stars
52 APPENDIX B. PREDICTION ERRORS
Table B.4: Prediction error made by the system on task 4. The system fails to
provide the information asked for.
hello
hello what can i help you with today
can you make a restaurant reservation at
resto_hanoi_expensive_japanese_1stars
great let me do the reservation
may i have the phone number of the restaurant
correct: here it is resto_hanoi_expensive_japanese_1stars_phone
prediction: here it is
APPENDIX B. PREDICTION ERRORS 53
Table B.5: Prediction error made by the system on task 6. The system
understands that a phone number is asked for but has difficulties using the
correct entities in sentences.
context: <silence>
hello welcome to the cambridge restaurant system you can ask for
restaurants by area price range or food type how may i help you
i would like to find a restaurant that serves halal food
you are looking for a halal restaurant right
halal
i m sorry but there is no restaurant serving halal food
is there a portugeuse food
could you please repeat that
how about portugeuse food
there are restaurants serving portuguese food what area do you
want
any area
there are restaurants serving portuguese food in any part of town
what price range would you like
any price range
api_call halal r_location r_price
api_call no result
<silence>
nandos serves portuguese food
whats the phone number
correct: the phone number of nandos is nandos_phone
prediction: the phone number of nandos_city_centre is nandos_phone
54 APPENDIX B. PREDICTION ERRORS
Table B.6: Prediction error made by the system on task 6. The prediction
only differs from the correct response in the first word and could be equally
correct.
context: <silence>
hello welcome to the cambridge restaurant system you can ask for
restaurants by area price range or food type how may i help you
<silence>
what kind of food would you like
persian restaurant in the north part of town
i m sorry but there is no restaurant serving persian food
chinese food
could you please repeat that
chinese food
correct: did you say you are looking for a restaurant in the north of
town
prediction: would you say you are looking for a restaurant in the north
of town
Table B.7: Prediction error made by the system on task 6. The prediction is a
nonsensical sentence.
context: <silence>
hello welcome to the cambridge restaurant system you can ask for
restaurants by area price range or food type how may i help you
<silence>
what kind of food would you like
i dont care in the north part of town
correct: there are restaurants in the north of town serving any kind of
food what price range would you like
prediction: would are restaurants in the expensive of town do tasty kind
of food would area range
www.kth.se