[go: up one dir, main page]

0% found this document useful (0 votes)
20 views11 pages

Text-To-SQL Spider Paper

Introduction to Text To SQL using Spider dataset and some relevant techniques

Uploaded by

dain55788
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

Text-To-SQL Spider Paper

Introduction to Text To SQL using Spider dataset and some relevant techniques

Uploaded by

dain55788
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Spider: A Large-Scale Human-Labeled Dataset for Complex and

Cross-Domain Semantic Parsing and Text-to-SQL Task

Tao Yu Rui Zhang Kai Yang Michihiro Yasunaga


Dongxu Wang Zifan Li James Ma Irenecourse_description
ID department_code
Li course_credit
Qingning Yao Shanelle Roman Zilin Zhang Dragomir R. Radev
Department of Computer Science, Yale University
course_code department_code course_description course_credit
{tao.yu, r.zhang, k.yang, michihiro.yasunaga, dragomir.radev}@yale.edu

Annotators check database schema (e.g., database: college)


Annotators check database schema (e.g., database: college)
Columns
Table name Abstract Columns
Table name
Table 1 instructor ID name
We department_name
present salary .... complex and
Spider, a large-scale
foreign key Table 1 instructor id name department_id salary ....
cross-domain semantic parsing and text-to- foreign key
Table 2 department SQL
name dataset
buildingannotated
budgetby .......
11 college stu-
primary
dents. key It consists of 10,181 questions and Table 2 department id name building budget .......
......

primary
5,693 unique complex SQL queries on 200

......
key
Table n
databases with multiple tables covering 138 Table n
Annotators create: different domains. We define a new complex
What are theand
name cross-domain semantic
and budget of the parsing and text-
departments Annotators create:
Complex
question with averageto-SQL
instructor salary
task so greater than thecomplicated SQL
that different Complex What are the name and budget of the departments
overall average? question with average instructor salary greater than the
queries and databases appear in train and test overall average?
Complex SELECT T1.department_name, T2.budget
SQL sets. In this way, the task requires the model
FROM instructor as T1 JOIN department as Complex SELECT T2.name, T2.budget
to generalize well
T2 ON T1.department_name to both new SQL queries
= T2.name SQL FROM instructor as T1 JOIN department as
GROUP BY T1.department_name
and new database schemas. Therefore, Spi- T2 ON T1.department_id = T2.id
HAVING avg(T1.salary) > GROUP BY T1.department_id
der is distinct from most
(SELECT avg(salary) FROM instructor) of the previous se- HAVING avg(T1.salary) >
mantic parsing tasks because they all use a (SELECT avg(salary) FROM instructor)
single database and have the exact same pro-
Figure 1: Our corpus annotates complex questions and
gram in the train set and the test set. We ex-
SQLs. The example contains joining of multiple tables,
periment with various state-of-the-art models
a GROUP BY component, and a nested query.
and the best model achieves only 9.7% ex-
act matching accuracy on a database split set-
ting. This shows that Spider presents a strong these results are predicted by semantic “matching”
challenge for future research. Our dataset and rather than semantic parsing. Existing datasets for
Database Collection andwith
task Creation
the most recent updates Question and SQL Annotation
are pub- SP have two shortcomings. First, those that have
200 databases (DB)
licly available at https://yale-lily. 20-50 examples per DB
complex programs (Zelle and Mooney, 1996; Li
150 man-hours
github.io/seq2sql/spider. 500 man-hours
and Jagadish, 2014; Yaghmazadeh et al., 2017a;
Iyer et al., 2017) are too small in terms of num-
1 Introduction
ber of programs for training modern data-intensive
SQL Review Semantic
Questionparsing
Review(SP) is one of the most
and Paraphrase important
Final Review andmodels and have only a single dataset, meaning
Processing
150 man-hours 150 man-hours
tasks in natural language processing (NLP). It re-150 man-hours
that the same database is used for both training
quires both understanding the meaning of natural and testing the model. More importantly, the num-
language sentences and mapping them to mean- ber of logic forms or SQL labels is small and
ingful executable queries such as logical forms, each program has about 4-10 paraphrases of nat-
abase Collection Question and SQL
& Creation
SQL queries, and Python
Annotation
code. ural language problem to expand the size of the
databases (DB) Recently, some state-of-the-art
20-50 examples per DB methods with dataset. Therefore, the exact same target programs
50 man-hours 500 man-hours are able to achieve over
Seq2Seq architectures appear in both the train and test sets. The mod-
80% exact matching accuracy even on some com- els can achieve decent performances even on very
plex benchmarks such as ATIS and GeoQuery. complex programs by memorizing the patterns of
Question Review
These models Final Review
seem to have& already solved most question and program pairs during training and de-
eview
& Paraphrase Processing
n-hours problems
150 man-hours
in this field.
150 man-hours
coding the programs exactly the same way as it
However, previous tasks in this field have a sim- saw in the training set during testing. Finegan-
ple but problematic task definition because most of Dollak et al. (2018) split the dataset by programs

3911
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921
Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics
so that no two identical program would be in both comes the two shortcomings of prior datasets, and
the train and test sets. They show that the models defines a new semantic parsing task in which the
built on this question-splitting data setting fail to model needs to generalize not only to new pro-
generalize to unseen programs. Second, existing grams but also to new databases. Models have to
datasets that are large in terms of the number of take questions and database schemas as inputs and
programs and databases such as WikiSQL (Zhong predict unseen queries on new databases.
et al., 2017) contain only simple SQL queries and To assess the task difficulty, we experiment with
single tables. In order to test a model’s real se- several state-of-the-art semantic parsing models.
mantic parsing performance on unseen complex All of them struggle on this task. The best model
programs and its ability to generalize to new do- achieves only 9.7% exact matching accuracy in the
mains, an SP dataset that includes a large amount database split setting. This suggests that there is a
of complex programs and databases with multiple large room for improvement.
tables is a must.
However, compared to other large, realistic 2 Related Work and Existing Datasets
datasets such as ImageNet for object recognition
Several semantic parsing datasets with different
(Deng et al., 2009) and SQuAD for reading com-
queries have been created. The output can be in
prehension (Rajpurkar et al., 2016), creating such
many formats, e.g., logic forms. These datasets
SP dataset is even more time-consuming and chal-
include ATIS (Price, 1990; Dahl et al., 1994), Geo-
lenging in some aspects due to the following rea-
Query (Zelle and Mooney, 1996), and JOBS (Tang
sons. First, it is hard to find many databases
and Mooney, 2001a). They have been studied ex-
with multiple tables online. Second, given a
tensively (Zelle and Mooney, 1996; Zettlemoyer
database, annotators have to understand the com-
and Collins, 2005; Wong and Mooney, 2007; Das
plex database schema to create a set of questions
et al., 2010; Liang et al., 2011; Banarescu et al.,
such that their corresponding SQL queries cover
2013; Artzi and Zettlemoyer, 2013; Reddy et al.,
all SQL patterns. Moreover, it is even more chal-
2014; Berant and Liang, 2014; Dong and Lapata,
lenging to write different complex SQL queries.
2016). However, they are domain specific and
Additionally, reviewing and quality-checking of
there is no standard label guidance for multiple
question and SQL pairs takes a significant amount
SQL queries.
of time. All of these processes require very spe-
Recently, more semantic parsing datasets using
cific knowledge in databases.
SQL as programs have been created. Iyer et al.
To address the need for a large and high-quality (2017) and Popescu et al. (2003a) labeled SQL
dataset for a new complex and cross-domain se- queries for ATIS and GeoQuery datasets. Other
mantic parsing task, we introduce Spider, which existing text-to-SQL datasets also include Restau-
consists of 200 databases with multiple tables, rants (Tang and Mooney, 2001b; Popescu et al.,
10,181 questions, and 5,693 corresponding com- 2003a), Scholar (Iyer et al., 2017), Academic
plex SQL queries, all written by 11 college stu- (Li and Jagadish, 2014), Yelp and IMDB (Yagh-
dents spending a total of 1,000 man-hours. As mazadeh et al., 2017b), Advising (Finegan-Dollak
Figure 1 illustrates, given a database with multiple et al., 2018), and WikiSQL (Zhong et al., 2017).
tables including foreign keys, our corpus creates These datasets have been studied for decades in
and annotates complex questions and SQL queries both the NLP community (Warren and Pereira,
including different SQL clauses such as joining 1982; Popescu et al., 2003b, 2004; Li et al., 2006;
and nested query. In order to generate the SQL Giordani and Moschitti, 2012; Wang et al., 2017;
query given the input question, models need to un- Iyer et al., 2017; Zhong et al., 2017; Xu et al.,
derstand both the natural language question and 2017; Yu et al., 2018; Huang et al., 2018; Wang
relationships between tables and columns in the et al., 2018; Dong and Lapata, 2018; McCann
database schema. et al., 2018) and the Database community (Li and
In addition, we also propose a new task for Jagadish, 2014; Yaghmazadeh et al., 2017b). We
text-to-SQL problem. Since Spider contains 200 provide detailed statistics on these datasets in Ta-
databases with foreign keys, we can split the ble 1.
dataset with complex SQL queries in a way that Most of the previous work train their models
no database overlaps in train and test, which over- without schemas as inputs because they use a sin-

3912
SQL Review Question Review and Paraphrase Final Review a
150 man-hours 150 man-hours 150 ma

gle database for both training and testing. Thus, Database Collection Question and SQL
& Creation Annotation
they do not need to generalize to new domains. 200 databases (DB) 20-50 examples per DB
Most importantly, these datasets have a limited 150 man-hours 500 man-hours

number of labeled logic forms or SQL queries.


In order to expand the size of these datasets and Question Review Final Review &
apply neural network approaches, each logic form SQL Review
& Paraphrase Processing
150 man-hours
150 man-hours 150 man-hours
or SQL query has about 4-10 paraphrases for the
natural language input. Most previous studies fol- Figure 2: The annotation process of our Spider corpus.
low the standard question-based train and test split
(Zettlemoyer and Collins, 2005). This way, the ex- These tasks parse natural language descriptions
act same target queries (with similar paraphrases) into a more general-purpose programming lan-
in the test appear in training set as well. Utiliz- guage such as Python (Allamanis et al., 2015; Ling
ing this assumption, existing models can achieve et al., 2016; Rabinovich et al., 2017; Yin and Neu-
decent performances even on complex programs big, 2017).
by memorizing database-specific SQL templates.
However, this accuracy is artificially inflated be- 3 Corpus Construction
cause the model merely needs to decide which
All questions and SQL queries were written and
template to use during testing. Finegan-Dollak
reviewed by 11 computer science students who
et al. (2018) show that template-based approaches
were native English speakers.As illustrated in Fig-
can get even higher results. To avoid getting this
ure 2, we develop our dataset in five steps, spend-
inflated result, Finegan-Dollak et al. (2018) pro-
ing around 1,000 hours of human labor in total:
pose a new, program-based splitting evaluation,
§3.1 Database Collection and Creation, §3.2 Ques-
where the exact same queries do not appear in
tion and SQL Annotation, §3.3 SQL Review, §3.4
both training and testing. They show that un-
Question Review and Paraphrase, §3.5 Final Ques-
der this framework, the performance of all the
tion and SQL Review.
current state-of-the-art semantic parsing systems
drops dramatically even on the same database, in- 3.1 Database Collection and Creation
dicating that these models fail to generalize to un-
seen queries. This indicates that current studies in Collecting databases with complex schemas is
semantic parsing have limitations. hard. Although relational databases are widely
used in industry and academia, most of them are
We also want the model to generalize not only not publicly available. Only a few databases with
to unseen queries but also to unseen databases. multiple tables are easily accessible online.
Zhong et al. (2017) published the WikiSQL Our 200 databases covering 138 different do-
dataset. In their problem definition, the databases mains are collected from three resources. First,
in the test set do not appear in the train or de- we collected about 70 complex databases from dif-
velopment sets. Also, the task needs to take dif- ferent college database courses, SQL tutorial web-
ferent table schemas as inputs. Therefore, the sites, online csv files, and textbook examples. Sec-
model has to generalize to new databases. How- ond, we collected about 40 databases from the
ever, in order to generate about 90,000 questions DatabaseAnswers1 where contains over 1,000 data
and SQL pairs for about 26,000 databases, Zhong models across different domains. These data mod-
et al. (2017) made simplified assumptions about els contain only database schemas. We converted
the SQL queries and databases. Their SQL labels them into SQLite, populated them using an on-
only cover single SELECT column and aggrega- line database population tool2 , and then manu-
tion, and WHERE conditions. Moreover, all the ally corrected some important fields so that the ta-
databases only contain single tables. No JOIN, ble contents looked natural. Finally, we created
GROUP BY, and ORDER BY, etc. are included. the remaining 90 databases based on WikiSQL.
Recently, researchers have constructed some To ensure the domain diversity, we select about
datasets for code generation including IFTTT 500 tables in about 90 different domains to cre-
(Quirk et al., 2015), DJANGO (Oda et al., 2015), ate these 90 databases. To create each database,
HEARTHSTONE (Ling et al., 2016), NL2Bash 1
http://www.databaseanswers.org/
2
(Lin et al., 2018), and CoNaLa (Yin et al., 2018). http://filldb.info/

3913
we chose several related tables from WikiSQL require knowledge outside the database to answer.
dev or test splits, and then created a relational First, ambiguous questions refer to the ques-
database schema with foreign keys based on the tions that do not have enough clues to infer which
tables we selected. We had to create some inter- columns to return and which conditions to con-
section tables in order to link several tables to- sider. For example, we would not ask “What is
gether. For most other cases, we did not need to the most popular class at University X?” because
populate these databases since tables in WikiSQL the definition of “popular” is not clear: it could
are from Wikipedia, which already had real world mean the rating of the class or the number of stu-
data stored. dents taking the course. Instead, we choose to ask
We manually corrected some database schemas “What is the name of the class which the largest
if they had some column names that did not make number of students are taking at University X?”.
sense or missed some foreign keys. For table and Here, “popular” refers to the size of student en-
column names, it is common to use abbreviations rollment. Thus, the “student enrollment” column
in databases. For example, ‘student id’ might be can be used in condition to answer this question.
represented by ‘stu id’. For our task definition, we We recognize that ambiguous questions appear in
manually changed each column name back to reg- real-world natural language database interfaces.
ular words so that the system only handled seman- We agree that future work needs to address
tic parsing issues. this issue by having multi-turn interactions be-
tween the system and users for clarification. How-
3.2 Question and SQL Annotation ever, our main aim here is to develop a corpus to
For each database, we ask eight computer science tackle the problem of handling complex queries
students proficient in SQL to create 20-50 natu- and generalizing across databases, which no ex-
ral questions and their SQL labels. To make our isting semantic parsing datasets could do. More-
questions diverse, natural, and reflective of how over, the low performances of current state-of-the-
humans actually use databases, we did not use any art models already show that our task is challeng-
template or script to generate question and SQL ing enough, without ambiguous questions. In ad-
queries. Our annotation procedure ensures the fol- dition, questions are required to contain the spe-
lowing three aspects. cific information to return. Otherwise, we don’t
know if class id is also acceptable in the previous
A) SQL pattern coverage. We ensure that
case. Most of questions in the existing seman-
our corpus contains enough examples for all
tic parsing datasets are ambiguous. This is not a
common SQL patterns. For each database, we
big problem if we use one single dataset because
ask annotators to write SQL queries that cover
we have enough data domain specific examples
all the following SQL components: SELECT
to know which columns are default. However, it
with multiple columns and aggregations, WHERE,
would be a serious problem in cross domain tasks
GROUP BY, HAVING, ORDER BY, LIMIT,
since the default return values differ cross domain
JOIN, INTERSECT, EXCEPT, UNION, NOT
and people.
IN, OR, AND, EXISTS, LIKE as well as nested
Second, humans sometimes ask questions that
queries. The annotators made sure that each table
require common sense knowledge outside the
in the database appears in at least one query.
given database. For instance, when people ask
B) SQL consistency. Some questions have mul- “Display the employee id for the employees who
tiple acceptable SQL queries with the same re- report to John”, the correct SQL is
sult. However, giving totally different SQL labels
to similar questions can hinder the training of se- SELECT employee id
mantic parsing models. To avoid this issue, we FROM employees
designed the annotation protocol so that all anno- WHERE manager id = (
tators choose the same SQL query pattern if mul- SELECT employee id
tiple equivalent queries are possible. More detail FROM employees
is explained in our appendix. WHERE first name = ‘John’)

C) Question clarity. We did not create ques- which requires the common knowledge that “X
tions that are (1) vague or too ambiguous, or (2) reports to Y” corresponds to an “employee-

3914
manager” relation. we do not include such ques- (LIMIT) and GROUP BY (HAVING) compo-
tions and leave them as a future research direction. nents than the total of previous text-to-SQL
datasets. Spider has 200 distinct databases cov-
Annotation tools We open each database on a ering 138 different domains such as college, club,
web-based interface powered by the sqlite web3 TV show, government, etc. Most domains have
tool. It allows the annotators to see the schema one database, thus containing 20-50 questions, and
and content of each table, execute SQL queries, a few domains such as flight information have
and check the returned results. This tool was ex- multiple databases with more than 100 questions
tremely helpful for the annotators to write exe- in total. On average, each database in Spider has
cutable SQL queries that reflect the true meaning 28 columns and 9 foreign keys. The average ques-
of the given questions and return correct answers. tion length and SQL length are about 13 and 21
respectively. Our task uses different databases for
3.3 SQL Review
training and testing, evaluating the cross-domain
Once the database is labeled with question-query performance. Therefore, Spider is the only one
pairs, we ask a different annotator to check if the text-to-SQL dataset that contains both databases
questions are clear and contain enough informa- with multiple tables in different domains and com-
tion to answer the query. For a question with plex SQL queries It tests the ability of a system
multiple possible SQL translations, the reviewers to generalize to not only new SQL queries and
double check whether the SQL label is correctly database schemas but also new domains.
chosen under our protocol. Finally, the reviewers
check if all the SQL labels in the current database 5 Task Definition
cover all the common SQL clauses.
On top of the proposed dataset, we define a text-
3.4 Question Review and Paraphrase to-SQL task that is more realistic than prior work.
Unlike most of the previous semantic parsing or
After SQL labels are reviewed, native English text-to-SQL tasks, models will be tested on both
speakers review and correct each question. They different complex SQL queries and different com-
first check if the question is grammatically correct plex databases in different domains in our task. It
and natural. Next, they make sure that the question aims to ensure that models can only make the cor-
reflects the meaning of its corresponding SQL la- rect prediction when they truly understand the se-
bel. Finally, to improve the diversity in questions, mantic meaning of the questions, rather than just
we ask annotators to add a paraphrased version to memorization. Also, because our databases con-
some questions. tain different domains, our corpus tests model’s
ability to generalize to new databases. In this way,
3.5 Final Review
model performance on this task can reflect the real
Finally, we ask the most experienced annotator to semantic parsing ability.
conduct the final question and SQL review. This In order to make the task feasible and to focus
annotator makes the final decision if multiple re- on the more fundamental part of semantic parsing,
viewers are not sure about some annotation issues. we make the following assumptions:
Also, we run a script to execute and parse all SQL
labels to make sure they are correct. • In our current task, we do not evaluate model
performance on generating values. Predicting
4 Dataset Statistics and Comparison correct SQL structures and columns is more re-
alistic and critical at this stage based on the
We summarize the statistics of Spider and other low performances of various current state-of-
text-to-SQL datasets in Table 1. Compared with the-art models on our task. In a real world situ-
other datasets, Spider contains databases with ation, people need to double check what condi-
multiple tables and contains SQL queries in- tion values are and finalize them after multiple
cluding many complex SQL components. For times. It is unrealistic to predict condition val-
example, Spider contains about twice more ues without interacting with users. In reality,
nested queries and 10 times more ORDER BY most people know what values to ask but do not
3
https://github.com/coleifer/ know the SQL logic. A more reasonable way is
sqlite-web to ask users to use an interface searching the

3915
Dataset #Q # SQL # DB # Domain # Table / DB ORDER BY GROUP BY NESTED HAVING
ATIS 5,280 947 1 1 32 0 5 315 0
GeoQuery 877 247 1 1 6 20 46 167 9
Scholar 817 193 1 1 7 75 100 7 20
Academic 196 185 1 1 15 23 40 7 18
IMDB 131 89 1 1 16 10 6 1 0
Yelp 128 110 1 1 7 18 21 0 4
Advising 3,898 208 1 1 10 15 9 22 0
Restaurants 378 378 1 1 3 0 0 4 0
WikiSQL 80,654 77,840 26,521 - 1 0 0 0 0
Spider 10,181 5,693 200 138 5.1 1335 1491 844 388

Table 1: Comparisons of text-to-SQL datasets. Spider is the only one text-to-SQL dataset that contains both
databases with multiple tables in different domains and complex SQL queries. It was designed to test the ability of
a system to generalize to not only new SQL queries and database schemas but also new domains.

values, then ask more specific questions. Also, Component Matching To conduct a detailed
other previous work with value prediction uses analysis of model performance, we measure the
one single database in both train and test which average exact match between the prediction and
makes it possible to overfit. However, in our ground truth on different SQL components. For
task, we have different databases of different each of the following components:
domains in train and test.
• SELECT • WHERE • GROUP BY
• As mentioned in the previous sections, we ex- • ORDER BY • KEYWORDS (including all
clude some queries that require outside knowl- SQL keywords without column names and
edge such as common sense inference and operators)
math calculation. For example, imagine a ta-
ble with birth and death year columns. To we decompose each component in the prediction
answer the questions like “How long is X’s and the ground truth as bags of several sub-
life length?”, we use SELECT death year components, and check whether or not these two
- birth year. Even though this example sets of components match exactly. To evaluate
is easy for humans, it requires some common each SELECT component, for example, con-
knowledge of the life length definition and the sider SELECT avg(col1), max(col2),
use of a math operation, which is not the focus min(col1), we first parse and decompose into
of our dataset. a set (avg, min, col1), (max, col2),
and see if the gold and predicted sets are the same.
• We assume all table and column names in the Previous work directly compared decoded SQL
database are clear and self-contained. For ex- with gold SQL. However, some SQL components
ample, some databases use database specific do not have order constraints. In our evaluation,
short-cut names for table and column names we treat each component as a set so that for ex-
such as “stu id”, which we manually converted ample, SELECT avg(col1), min(col1),
to “student id” in our corpus. max(col2) and SELECT avg(col1),
max(col2), min(col1) would be treated
6 Evaluation Metrics as the same query. To report a model’s overall
performance on each component, we compute F1
Our evaluation metrics include Component score on exact set matching.
Matching, Exact Matching, and Execution Ac-
curacy. In addition, we measure the system’s Exact Matching We measure whether the pre-
accuracy as a function of the difficulty of a query. dicted query as a whole is equivalent to the gold
Since our task definition does not predict value query. We first evaluate on the SQL clauses as de-
string, our evaluation metrics do not take value scribed in the last section. The predicted query
strings into account. is correct only if all of the components are cor-
We will release the official evaluation script rect. Because we conduct set comparison in each
along with our corpus so that the research com- clause, this exact matching metric can handle the
munity can share the same evaluation platform. “ordering issue” (Xu et al., 2017).

3916
Execution Accuracy4 Since Exact Matching Easy
can create false negative evaluation when the se- What is the number of cars with more than 4 cylinders?
mantic parser generates novel and correct syntax SELECT COUNT(*)
structures, we also consider Execution Accuracy. FROM cars_data
All our databases have executable SQLite files, so WHERE cylinders > 4
we can measure execution accuracy as well. How-
ever, it is also important to note that Execution Meidum
Accuracy can create false positive evaluation as a For each stadium, how many concerts are there?
predicted SQL could return the same result (for SELECT T2.name, COUNT(*)
example, ‘NULL’) as the gold SQL when they are FROM concert AS T1 JOIN stadium AS T2
semantically different. So we can use both to com- ON T1.stadium_id = T2.stadium_id
plement each other. GROUP BY T1.stadium_id
Finally, our evaluation also considers multiple
Hard
acceptable keys if JOIN and GROUP are in the
query. For example, suppose “stu id” in one ta- Which countries in Europe have at least 3 car
ble refers to “stu id” in another table, GROUP BY manufacturers?
either is acceptable. SELECT T1.country_name
FROM countries AS T1 JOIN continents
SQL Hardness Criteria To better understand AS T2 ON T1.continent = T2.cont_id
the model performance on different queries, we JOIN car_makers AS T3 ON
divide SQL queries into 4 levels: easy, medium, T1.country_id = T3.country
hard, extra hard. We define the difficulty WHERE T2.continent = 'Europe'
GROUP BY T1.country_name
based on the number of SQL components, selec-
HAVING COUNT(*) >= 3
tions, and conditions, so that queries that contain
more SQL keywords (GROUP BY, ORDER BY,
Extra Hard
INTERSECT, nested subqueries, column selec-
tions and aggregators, etc) are considered to be What is the average life expectancy in the countries
harder. For example, a query is considered as hard where English is not the official language?
if it includes more than two SELECT columns, SELECT AVG(life_expectancy)
more than two WHERE conditions, and GROUP FROM country
WHERE name NOT IN
BY two columns, or contains EXCEPT or nested
(SELECT T1.name
queries. A SQL with more additions on top of that FROM country AS T1 JOIN
is considered as extra hard. Figure 3 shows exam- country_language AS T2
ples of SQL queries in 4 hardness levels. ON T1.code = T2.country_code
WHERE T2.language = "English"
7 Methods AND T2.is_official = "T")

In order to analyze the difficulty and demonstrate Figure 3: SQL query examples in 4 hardness levels.
the purpose of our corpus, we experiment with
several state-of-the-art semantic parsing models.
As our dataset is fundamentally different from the the whole corpus.
prior datasets such as Geoquery and WikiSQL,
we adapted these models to our task as follows. Seq2Seq Inspired by neural machine translation
We created a ‘big’ column list by concatenating (Sutskever et al., 2014), we first apply a basic
columns in all tables of the database together as sequence-to-sequence model, Seq2Seq. Then, we
a input to all models. Also, for each model, we also explore Seq2Seq+Attention from (Dong and
limit the column selection space for each question Lapata, 2016) by adding an attention mechanism
example to all column of the database which the (Bahdanau et al., 2015). In addition, we include
question is asking instead of all column names in Seq2Seq+Copying by adding an attention-based
copying operation similar to (Jia and Liang, 2016).
4
We will provide the results in the later version. Please The original model does not take the schema
check our website for the latest updates on the task
at https://yale-lily.github.io/seq2sql/ into account because it has the same schema in
spider both train and test. We modify the model so that it

3917
Method Easy Medium Hard Extra Hard All
Example Split
Seq2Seq 24.3% 9.5% 6.3% 1.5% 11.2%
Seq2Seq+Attention (Dong and Lapata, 2016) 31.2% 13.9% 11.6% 3.3% 15.5%
Seq2Seq+Copying 30.0% 12.6% 10.0% 3.3% 14.8%
Iyer et al. (2017) 18.6% 12.0% 9.0% 1.8% 9.8%
SQLNet (Xu et al., 2017) 36.2% 15.6% 7.9% 4.9% 17.4%
TypeSQL (Yu et al., 2018) 48.6% 38.2% 18.1% 19.8% 34.3%
Database Split
Seq2Seq 17.9% 2.7% 1.3% 0.6% 5.4%
Seq2Seq+Attention (Dong and Lapata, 2016) 17.9% 2.9% 1.8% 1.3% 5.7%
Seq2Seq+Copying 15.1% 3.4% 1.0% 1.3% 5.2%
Iyer et al. (2017) 7.9% 2.1% 1.3% 1.1% 3.1%
SQLNet (Xu et al., 2017) 23.7% 5.9% 2.3% 0.3% 8.3%
TypeSQL (Yu et al., 2018) 29.6% 6.1% 2.3% 0.3% 9.7%
Table 2: Accuracy of Exact Matching on SQL queries with different hardness levels.

Method SELECT WHERE GROUP BY ORDER BY KEYWORDS


Example Split
Seq2Seq 25.8% 8.2% 18.9% 13.0% 21.8%
Seq2Seq+Attention 30.0% 11.4% 24.1% 19.3% 22.4%
Seq2Seq+Copying 28.7% 8.7% 24.2% 21.6% 19.7%
Iyer et al. (2017) 19.2% 8.1% 16.1% 9.6% 13.6%
SQLNet 46.1% 32.5% 30.6% 61.4% 77.6%
TypeSQL 68.7% 55.4% 40.5% 67.0% 73.6%
Database Split
Seq2Seq 13.7% 3.7% 3.2% 4.9% 8.9%
Seq2Seq+Attention 14.0% 5.0% 3.2% 6.1% 9.4%
Seq2Seq+Copying 12.0% 2.7% 5.2% 6.9% 6.7%
Iyer et al. (2017) 6.3% 1.9% 3.0% 3.6% 3.5%
SQLNet 24.0% 18.0% 11.8% 47.1% 61.9%
TypeSQL 36.2% 14.7% 6.4% 49.5% 59.4%
Table 3: F1 scores of Component Matching on all SQL queries.

considers the table schema information by passing nents.


a vocabulary mask that limits the model to decode
the words from SQL key words, table and column TypeSQL is the state-of-the-art model on the
names in current database. WikiSQL task (Yu et al., 2018). It improves upon
SQLNet by proposing a different training pro-
(Iyer et al., 2017) Iyer et al. (2017) apply an at- cedure and utilizing types extracted from either
tention based sequence-to-sequence model similar knowledge graph or table content to help model
to (Luong et al., 2015) to SQL datasets with auto- better understand entities and numbers in the ques-
matic dataset expansion through paraphrasing and tion. In our experiment, we use the question type
SQL templates. In addition, they show how user info extracted from database content. Also, we ex-
interactions improve results consistently. In our tend their modules to other components.
case, we did not consider the user interaction part.
8 Experimental Results and Discussion
SQLNet introduced by (Xu et al., 2017) uses
column attention and employs a sketch-based We summarize the performance of all models on
method and generates SQL as a slot-filling our test set including accuracy of exact matching
task. This fundamentally avoids the sequence-to- in Table 2 and F1 scores of component matching
sequence structure when ordering does not mat- in Table 3. For the final training dataset, we also
ter in SQL query conditions. Because it is orig- select and include 752 queries and 1659 questions
inally designed for WikiSQL, we also extend its that follow our annotation protocol from six ex-
SELECT and WHERE modules to other compo- isting datasets: Restaurants, GeoQuery, Scholar,

3918
Academic, IMDB, and Yelp. We report results on
two different settings for all models: (1) Exam-
ple split where examples are randomly split into
7862 train, 1831 dev, 2147 test. Questions for the
same database can appear in both train and test. (2)
Database split where 206 databases are randomly
split into 130 train, 36 dev, and 40 test. All ques-
tions for the same database are in the same split.

Overall Performance The performances of


the Seq2Seq-based models including Seq2Seq,
Seq2Seq+Attention, Seq2Seq+Copying, and Iyer
et al. (2017) are very low. However, they are able Figure 4: Exact matching accuracy as a function of the
to generate nested and complex queries. Thus, number of foreign keys.
they can get a few hard and extra hard examples
correct. But in the vast majority of cases, they pre- relation of tables with foreign keys.
dict invalid SQL queries with grammatical errors.
The attention and copying mechanisms do not help 9 Conclusion
much either. In contrast, SQLNet and TypeSQL In this paper we introduce Spider, a large, com-
that utilize SQL structure information to guide the plex and cross-domain semantic parsing and text-
SQL generation process significantly outperform to-SQL dataset, which directly benefits both NLP
other Seq2Seq model. While they can produce and DB communities. Based on Spider, we define
valid queries, however, they are unable to gener- a new challenging and realistic semantic parsing
ate nested queries or queries with keywords such task. Experimental results on several state-of-the-
as EXCEPT and INTERSECT. art models on this task suggests plenty space of
In general, the overall performances of all mod- improvement.
els are low, indicating that our task is challenging
and there is still a large room for improvement. Acknowledgement
Example Split vs Database Split As discussed We thank Graham Neubig, Tianze Shi, Catherine
in Section 5, another challenge of the dataset is to Finegan-Dollak, and three anonymous reviewers
generalize to new databases. To study this, in Ta- for their discussion and feedback. We also thank
ble 2 and Table 3 we compare model performances Barry Williams for providing datasets from the
under the two settings. For all models, the perfor- DatabaseAnswers.
mance under database split is much lower than that
under example split. In addition, we observe that
all models perform poorly on column selection. References
This shows that our dataset presents a challenge Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gor-
for the model to generalize to new databases. don, and Yi Wei. 2015. Bimodal modelling of
source code and natural language. In ICML, vol-
Complexity of Database Schema In order to ume 37 of JMLR Workshop and Conference Pro-
show how the complexity of the database schema ceedings, pages 2123–2132. JMLR.org.
affects model performance, Figure 4 plots the ex- Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su-
act matching accuracy as a function of the number pervised learning of semantic parsers for mapping
of foreign keys in a database. The performance instructions to actions. Transactions of the Associa-
decreases as the database has more foreign keys. tion forComputational Linguistics.
The first reason is because the model has to choose Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
column and table names from many candidates in gio. 2015. Neural machine translation by jointly
a complex database schema. Second, a complex learning to align and translate. In ICLR.
database schema presents a great challenge for the
Laura Banarescu, Claire Bonial, Shu Cai, Madalina
model to capture the relationship between differ- Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin
ent tables with foreign keys. It indicates that this Knight, Philipp Koehn, Martha Palmer, and Nathan
task requires more effective methods to encode the Schneider. 2013. Abstract meaning representation

3919
for sembanking. In Proceedings of the 7th Linguis- Fei Li and HV Jagadish. 2014. Constructing an in-
tic Annotation Workshop and Interoperability with teractive natural language interface for relational
Discourse. databases. VLDB.
Jonathan Berant and Percy Liang. 2014. Semantic Yunyao Li, Huahai Yang, and HV Jagadish. 2006.
parsing via paraphrasing. In Proceedings of the Constructing a generic natural language interface for
52nd Annual Meeting of the Association for Compu- an xml database. In EDBT, volume 3896, pages
tational Linguistics (Volume 1: Long Papers), pages 737–754. Springer.
1415–1425, Baltimore, Maryland. Association for
Computational Linguistics. P. Liang, M. I. Jordan, and D. Klein. 2011. Learn-
ing dependency-based compositional semantics. In
Deborah A. Dahl, Madeleine Bates, Michael Brown, Association for Computational Linguistics (ACL),
William Fisher, Kate Hunicke-Smith, David Pallett, pages 590–599.
Christine Pao, Alexander Rudnicky, and Elizabeth
Shriberg. 1994. Expanding the scope of the atis task: Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer,
The atis-3 corpus. In Proceedings of the Workshop and Michael D. Ernst. 2018. Nl2bash: A corpus and
on Human Language Technology, HLT ’94, Strouds- semantic parser for natural language interface to the
burg, PA, USA. Association for Computational Lin- linux operating system. In LREC.
guistics.
Wang Ling, Phil Blunsom, Edward Grefenstette,
Dipanjan Das, Nathan Schneider, Desai Chen, and Karl Moritz Hermann, Tomás Kociský, Fumin
Noah A. Smith. 2010. Probabilistic frame-semantic Wang, and Andrew Senior. 2016. Latent predictor
parsing. In NAACL. networks for code generation. In ACL (1). The As-
sociation for Computer Linguistics.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
Fei. 2009. ImageNet: A Large-Scale Hierarchical Thang Luong, Hieu Pham, and Christopher D. Man-
Image Database. In CVPR09. ning. 2015. Effective approaches to attention-based
neural machine translation. In Proceedings of the
Li Dong and Mirella Lapata. 2016. Language to logi- 2015 Conference on Empirical Methods in Natural
cal form with neural attention. In Proceedings of the
Language Processing.
54th Annual Meeting of the Association for Compu-
tational Linguistics, ACL 2016, August 7-12, 2016, Bryan McCann, Nitish Shirish Keskar, Caiming Xiong,
Berlin, Germany, Volume 1: Long Papers. and Richard Socher. 2018. The natural language de-
Li Dong and Mirella Lapata. 2018. Coarse-to-fine de- cathlon: Multitask learning as question answering.
coding for neural semantic parsing. In Proceed- arXiv preprint arXiv:1806.08730.
ings of the 56th Annual Meeting of the Associa- Yusuke Oda, Hiroyuki Fudaba, Graham Neubig,
tion for Computational Linguistics (Volume 1: Long Hideaki Hata, Sakriani Sakti, Tomoki Toda, and
Papers), pages 731–742. Association for Computa- Satoshi Nakamura. 2015. Learning to generate
tional Linguistics. pseudo-code from source code using statistical ma-
Catherine Finegan-Dollak, Jonathan K. Kummer- chine translation (t). In Proceedings of the 2015
feld, Li Zhang, Karthik Ramanathan Dhanalak- 30th IEEE/ACM International Conference on Auto-
shmi Ramanathan, Sesh Sadasivam, Rui Zhang, and mated Software Engineering (ASE), ASE ’15.
Dragomir Radev. 2018. Improving text-to-sql eval- Ana-Maria Popescu, Alex Armanasu, Oren Etzioni,
uation methodology. In ACL 2018. Association for David Ko, and Alexander Yates. 2004. Modern
Computational Linguistics. natural language interfaces to databases: Compos-
Alessandra Giordani and Alessandro Moschitti. 2012. ing statistical parsing with semantic tractability. In
Translating questions to sql queries with genera- Proceedings of the 20th international conference on
tive parsers discriminatively reranked. In COLING Computational Linguistics, page 141. Association
(Posters), pages 401–410. for Computational Linguistics.

Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen Ana-Maria Popescu, Oren Etzioni, and Henry Kautz.
tau Yih, and Xiaodong He. 2018. Natural language 2003a. Towards a theory of natural language inter-
to structured query generation via meta-learning. In faces to databases. In Proceedings of the 8th Inter-
NAACL. national Conference on Intelligent User Interfaces.

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Ana-Maria Popescu, Oren Etzioni, and Henry Kautz.
Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. 2003b. Towards a theory of natural language in-
Learning a neural semantic parser from user feed- terfaces to databases. In Proceedings of the 8th in-
back. CoRR, abs/1704.08760. ternational conference on Intelligent user interfaces,
pages 149–157. ACM.
Robin Jia and Percy Liang. 2016. Data recombination
for neural semantic parsing. In Proceedings of the P. J. Price. 1990. Evaluation of spoken language sys-
54th Annual Meeting of the Association for Compu- tems: the atis domain. In Speech and Natural Lan-
tational Linguistics (Volume 1: Long Papers). guage: Proceedings of a Workshop Held at Hidden

3920
Valley, Pennsylvania, June 24-27,1990, pages 91– Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet:
95. Generating structured queries from natural language
without reinforcement learning. arXiv preprint
Chris Quirk, Raymond Mooney, and Michel Galley. arXiv:1711.04436.
2015. Language to code: Learning semantic parsers
for if-this-then-that recipes. In Proceedings of the Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and
53rd Annual Meeting of the Association for Compu- Thomas Dillig. 2017a. Sqlizer: query synthesis
tational Linguistics and the 7th International Joint from natural language. Proceedings of the ACM on
Conference on Natural Language Processing (Vol- Programming Languages.
ume 1: Long Papers), volume 1, pages 878–888.
Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and
Maxim Rabinovich, Mitchell Stern, and Dan Klein. Thomas Dillig. 2017b. Sqlizer: Query synthesis
2017. Abstract syntax networks for code generation from natural language. Proc. ACM Program. Lang.,
and semantic parsing. In ACL (1), pages 1139–1149. 1(OOPSLA):63:1–63:26.
Association for Computational Linguistics.
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Vasilescu, and Graham Neubig. 2018. Learning to
and Percy Liang. 2016. Squad: 100, 000+ ques- mine aligned code and natural language pairs from
tions for machine comprehension of text. CoRR, stack overflow. In International Conference on Min-
abs/1606.05250. ing Software Repositories (MSR).
Siva Reddy, Mirella Lapata, and Mark Steedman. 2014. Pengcheng Yin and Graham Neubig. 2017. A syntactic
Large-scale semantic parsing without question- neural model for general-purpose code generation.
answer pairs. Transactions of the Association for In ACL (1), pages 440–450. Association for Compu-
Computational Linguistics, 2:377–392. tational Linguistics.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and
Sequence to sequence learning with neural net- Dragomir Radev. 2018. Typesql: Knowledge-based
works. In Advances in neural information process- type-aware neural text-to-sql generation. In Pro-
ing systems, pages 3104–3112. ceedings of NAACL. Association for Computational
Linguistics.
Lappoon R. Tang and Raymond J. Mooney. 2001a. Us-
ing multiple clause constructors in inductive logic John M. Zelle and Raymond J. Mooney. 1996. Learn-
programming for semantic parsing. In Proceedings ing to parse database queries using inductive logic
of the 12th European Conference on Machine Learn- programming. In AAAI/IAAI, pages 1050–1055,
ing, pages 466–477, Freiburg, Germany. Portland, OR. AAAI Press/MIT Press.
Lappoon R Tang and Raymond J Mooney. 2001b. Us- Luke S. Zettlemoyer and Michael Collins. 2005.
ing multiple clause constructors in inductive logic Learning to map sentences to logical form: Struc-
programming for semantic parsing. In ECML, vol- tured classification with probabilistic categorial
ume 1, pages 466–477. Springer. grammars. UAI.
Chenglong Wang, Alvin Cheung, and Rastislav Bodik. Victor Zhong, Caiming Xiong, and Richard Socher.
2017. Synthesizing highly expressive sql queries 2017. Seq2sql: Generating structured queries
from input-output examples. In Proceedings of the from natural language using reinforcement learning.
38th ACM SIGPLAN Conference on Programming CoRR, abs/1709.00103.
Language Design and Implementation, pages 452–
466. ACM.
Chenglong Wang, Po-Sen Huang, Alex Polozov,
Marc Brockschmidt, and Rishabh Singh. 2018.
Execution-guided neural program decoding. In
ICML workshop on Neural Abstract Machines and
Program Induction v2 (NAMPI).
David HD Warren and Fernando CN Pereira. 1982. An
efficient easily adaptable system for interpreting nat-
ural language queries. Computational Linguistics,
8(3-4):110–122.
Yuk Wah Wong and Raymond J. Mooney. 2007. Learn-
ing synchronous grammars for semantic parsing
with lambda calculus. In Proceedings of the 45th
Annual Meeting of the Association for Computa-
tional Linguistics (ACL-2007), Prague, Czech Re-
public.

3921

You might also like