[go: up one dir, main page]

0% found this document useful (0 votes)
26 views7 pages

PICARD

Text to sql research paper publish in 2021

Uploaded by

Abdul Bari Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views7 pages

PICARD

Text to sql research paper publish in 2021

Uploaded by

Abdul Bari Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

P ICARD:

Parsing Incrementally for Constrained Auto-Regressive Decoding


from Language Models

Torsten Scholak and Nathan Schucher and Dzmitry Bahdanau


ElementAI, a ServiceNow company
{torsten.scholak,dzmitry.bahdanau}@servicenow.com

Abstract 0.75

Large pre-trained language models for textual

exact match accuracy


0.70
data have an unconstrained output space; at
each decoding step, they can produce any of
10,000s of sub-word tokens. When fine-tuned 0.65
to target constrained formal languages like
SQL, these models often generate invalid code, 0.60
rendering it unusable. We propose P ICARD1 ,
a method for constraining auto-regressive de- 0.55 T5-Base T5-Large T5-3B
coders of language models through incremen-
none top-2 top-4 top-8
tal parsing. P ICARD helps to find valid output 0.50
sequences by rejecting inadmissible tokens at 1 2 4 8 16
each decoding step. On the challenging Spider beam size
and CoSQL text-to-SQL translation tasks, we
show that P ICARD transforms fine-tuned T5 Figure 1: Exact-set-match accuracy of the highest-
models with passable performance into state- scoring prediction as a function of beam size on the Spi-
of-the-art solutions. der text-to-SQL development set. With P ICARD turned
on, token predictions had to pass P ICARD checking at
1 Introduction every decoding step. Only the top-2, -4, and -8 token
predictions of each hypothesis were considered in the
While there have been many successes in applying
beam search. With P ICARD turned off (none), all token
large pre-trained language models to downstream predictions were considered and none were checked.
tasks, our ability to control and constrain the out- The models, T5-Base, -Large, and -3B, did not have
put of these models is still very limited. Many access to any database content, only to the database
enterprise applications are out of reach because schemas.
they require a degree of rigour and exactitude that
language models are not able to deliver yet. If the
target is a formal language like SQL, then we would paradigm have been proposed (Rubin and Berant,
like the model to adhere exactly and provably to the 2021). However, while effective, these approaches
SQL specification with all its lexical, grammatical, have in common that they are achieved at the ex-
logical, and semantical constraints. Unfortunately, pense of using a custom vocabulary of special con-
with pre-training alone, language models may not trol tokens or a custom model architecture, or both.
satisfy these correctness requirements. Unfortunately, this makes them incompatible with
For text-to-SQL translation, the most widespread generic pre-trained language model decoders. A
solution to constrained decoding is to make invalid less invasive and more compatible approach is to
SQL unrepresentable. For a while now it has been not constrain the generation process, but instead to
possible to restrict auto-regressive decoding to only filter finalized beam hypotheses by validity (Suhr
those token sequences that correctly parse to SQL et al., 2020; Lin et al., 2020). Yet, such filtering is
abstract syntax trees (Yin and Neubig, 2018; Lin at the expense of a very large beam size.
et al., 2019; Wang et al., 2020). More recently, We address the expenses of these approaches
semi-auto-regressive improvements to this parsing with a novel incremental parsing method for con-
1
The P ICARD code is available at https://github. strained decoding called P ICARD , which stands
com/ElementAI/picard. for "Parsing Incrementally for Constrained Auto-
9895
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9895–9901
November 7–11, 2021. c 2021 Association for Computational Linguistics
access to SQL schema information, in particular,
information about the names of tables and columns
and about which column resides in which table.
At each generation step, P ICARD first restricts
prediction to the top-k highest probability tokens
and then assigns a score of −∞ to those that fail
P ICARD’s numerous checks (see Figure 2). These
checks are enabled by fast incremental parsing
(O’Sullivan and Gamari, 2021) based on monadic
combinators (Leijen and Meijer, 2001). There are
Figure 2: Illustration of constrained beam search with four P ICARD mode settings that control their com-
beam size 2 and P ICARD. Each vertical column repre- prehensiveness: off (no checking), lexing, parsing
sents three token predictions for a hypothesis from top without guards, and parsing with guards—the high-
to bottom in descending order by probability. In this est mode. A prediction that passes a higher mode
example, P ICARD is configured to only check the top- will always pass a lower mode but not necessarily
2 highest ones. The rest is automatically dismissed by vice versa.
setting their score to −∞. Tokens rejected by P ICARD
(red, ×) are also assigned a score of −∞. Accepted 2.1 Lexing
tokens (green, ) keep their original score.
In lexing mode, P ICARD checks the output on
a lexical level only. It attempts to convert the
Regressive Decoding." P ICARD is compatible with partial, detokenized model output to a white-space
any existing auto-regressive language model de- delimited sequence of individual SQL keywords
coder and vocabulary—including, but not limited like select, punctuation like (), operators like
to, those of large pre-trained transformers—and it + and -, literals like string and number values in
does not require very large beam sizes. P ICARD is SQL conditions, and identifiers like aliases, tables,
entirely absent from pre-training or fine-tuning of and columns—without being sensitive to the order
the model, and can be easily and optionally enabled in which these lexical items appear. By making it
at inference time. P ICARD operates directly on the so, P ICARD can detect spelling errors in keywords
output of the language model which, in the case or reject table and column names that are invalid
of text-to-SQL translation, is the readable surface for the given SQL schema. For instance, consider
form of the SQL code. the question "What are the email, cell phone
In our experiments, we find that P ICARD can and home phone of each professional?" from
significantly improve the performance of a large Spider’s development set on the dog_kennels
pre-trained language model (Raffel et al., 2020) database. Our fine-tuned T5-Large model predicts
after it is fine-tuned on the text-to-SQL task. On select email_address, cell_phone,
the Spider text-to-SQL dataset (Yu et al., 2018), we home_phone from professionals while
find that a T5-Base model with P ICARD can out- the ground truth selects cell_number instead of
perform a T5-Large model without it, and likewise the invalid cell_phone column. This mistake is
for a T5-Large and a T5-3B model. Significantly, caught and avoided by P ICARD in lexing mode.
with the help of P ICARD, a T5-3B model can be
2.2 Parsing without Guards
raised to state-of-the-art performance on the Spider
and CoSQL datasets (Yu et al., 2019). In the lowest parsing mode above lexing—referred
to as parsing without guards—P ICARD checks the
2 The P ICARD Method output on a grammatical level. P ICARD attempts to
parse the detokenized model output to a data struc-
P ICARD warps model prediction scores and inte- ture that represents the abstract syntax tree (AST)
grates trivially with existing algorithms for greedy of the predicted SQL query. Contrary to lexing
and beam search used in auto-regressive decoding mode, the order in which keywords and clauses ap-
from language models. Its arguments are the token pear now matters. P ICARD can reject invalid query
ids of the current hypothesis and, for each vocabu- structures, e.g. find missing from clauses or incor-
lary token, the log-softmax scores predicted by the rect orders of clauses and keywords. It can also
model’s language modeling head. P ICARD also has detect a range of issues with compositions of SQL
9896
expressions: Number one, if P ICARD matches on evaluate on Spider’s development set and its hid-
a tid.cid pattern, but the table with the id tid den test set. We also report results on the CoSQL
does not contain a column with id cid, then that SQL-grounded dialog state tracking task (Yu et al.,
parse is rejected. Secondly, if P ICARD first matches 2019), where we predict a SQL query for each
on an alias.cid pattern and then later matches question given previous questions in an interaction
on the tid as alias pattern but tid does not context. For this task, we train on both the Spider
contain cid, then that parse is also rejected. An text-to-SQL training data and the CoSQL dialog
equivalent rule also exists for sub-queries bound to state tracking training data, and evaluate on the
table aliases. Lastly, P ICARD prohibits duplicate CoSQL development and test sets.
binding of a table alias in the same select scope, Spider and CoSQL are both zero-shot settings.
but permits shadowing of aliases defined in a sur- There is no overlap between questions or databases
rounding scope. This can happen in nested SQL between the respective training, development, and
queries. test sets.
On Spider, we determine model performance
2.3 Parsing with Guards
based on three metrics: exact-set-match accuracy,
In its highest parsing mode, P ICARD engages execution accuracy, and test-suite execution accu-
in additional analyses—called guards—while as- racy (Zhong et al., 2020). Exact-set-match accu-
sembling the SQL AST. If P ICARD matches on racy compares the predicted and the ground-truth
tid.cid or alias.cid, then guards require SQL query by parsing both into a normalized data
that the table tid or the alias alias, respectively, structure. This comparison is not sensitive to lit-
is eventually brought into scope by adding it to the eral query values and can decrease under semantic-
from clause. Moreover, the alias alias is con- preserving SQL query rewriting. Execution accu-
strained to resolve to a table or a sub-query that has racy compares the results of executing the predicted
the column cid in it. If P ICARD matches on the and ground-truth SQL queries on the database con-
pattern cid, then another guard requires that ex- tents shipped with the Spider dataset. This metric is
actly one table is eventually brought into scope that sensitive to literal query values, but suffers from a
contains a column with that id. These guards are high false positive rate (Zhong et al., 2020). Lastly,
enforced eagerly in order to fail fast and to eject test-suite execution accuracy extends execution to
invalid hypotheses from the beam at the earliest multiple database instances per SQL schema. The
possible time. The first time this is happening is contents of these instances are optimized to lower
after parsing the from clause. the number of false positives and to provide the
Only with these guards, P ICARD is able to best approximation of semantic accuracy.
reject a wrong prediction from our fine-tuned On CoSQL, we measure model performance in
T5-Large model like select maker, model terms of the question match accuracy and the inter-
from car_makers for the question "What are action match accuracy. Both metrics are based on
the makers and models?" Here, the correct table to exact-set-match accuracy. Interaction match accu-
use would have been model_list, since it is the racy is the joint accuracy over all questions in an
only one in Spider’s car_1 schema that contains interaction.
both a maker and a model column.
We are encouraged by results by Shaw et al.
Additional checks and guards are conceivable, (2021), who showed that a pre-trained T5-Base
for instance, checking that only expressions of or T5-3B model can not only learn the text-to-
the same type are compared or that column types SQL task, but also generalize to unseen databases,
selected by union, except, or intersect and even that T5-3B can be competitive with the
queries match. We leave these additional checks to then-state-of-the-art (Choi et al., 2021; Wang et al.,
future work. 2020)—all without modifications to the model. We
therefore use T5 as the baseline for all our experi-
3 Experiments
ments.
Our experiments are mainly focused on Spider In order to allow for generalization to unseen
(Yu et al., 2018), a large multi-domain and cross- databases, we encode the schema together with the
database dataset for text-to-SQL parsing. We train questions. We use the same serialization scheme
on the 7,000 examples in the Spider training set and used by Shaw et al. (2021). In experiments using
9897
database content, we detect and attach the database
values to the column names in a fashion similar to 0.70
the BRIDGE model by Lin et al. (2020). When

exact match accuracy


0.68
fine-tuning for the CoSQL dialog state tracking
task, we append the previous questions in the in- 0.66
teraction in reverse chronological order to the in-
put. Inputs exceeding the 512-token limit of T5 are 0.64 turned off none
truncated. The target is the SQL from the Spider lexing finalizing
and/or CoSQL training sets, unmodified except for 0.62 parsing w/o guards incremental
a conversion of keywords and identifiers to lower parsing w guards
case. We fine-tune T5 for up to 3072 epochs using 0.60
Adafactor (Shazeer and Stern, 2018), a batch size 1 2 4 8
beam size
of 2048, and a learning rate of 10−4 .
Figure 3: Exact-set-match accuracy on the Spider de-
Results Our findings on the Spider dataset are
velopment set as a function of beam size for top-4
summarized in Table 1 and Figure 1. Our repro- P ICARD on T5-Large (schema only) and for different
ductions of Shaw et al. (2021)’s results with T5 operation modes: turned off, lexing, parsing without
cannot compete with the current state of the art on guards, and parsing with guards. In each mode, P I -
Spider. The issue is that these models predict a CARD is either used incrementally at each step or only
lot of invalid SQL. For instance, 12% of the SQL when finalizing a hypothesis.
queries generated by the T5-3B model on Spider’s
development set result in an execution error. How-
for different beam sizes and sizes of T5. For
ever, when these same models are augmented with
each model size, P ICARD increases performance
P ICARD, we find substantial improvements. First,
with increasing beam size. These increases are
invalid SQL predictions become rare. For T5-3B
the strongest for the step from beam size 1 to 2,
with P ICARD, only 2% of the predictions are un-
less pronounced from 2 to 4, and then saturating
usable. In these cases, beam search exited without
for beam sizes above 4. Even with greedy search
finding a valid SQL prediction. Second, and most
(beam size 1), P ICARD allows for some modest
significantly, by using P ICARD, the T5-3B model is
improvements. Note that, without P ICARD, these
lifted to state-of-the-art performance. We measure
models do not benefit from beam search. The num-
an exact-set-match accuracy of 75.5% on the devel-
ber, k, of highest-probability tokens that are pro-
opment set and 71.9% on the test set. The execution
cessed by P ICARD at each decoding step has a
accuracy results are 79.3% and 75.1%, respectively.
modest to negligible impact on performance. It is
These numbers are on par or higher than those of
the largest for T5-Base, smaller for T5-Large, and
the closest competitor, LGESQL + ELECTRA
almost undetectable for T5-3B. We do not study
(Cao et al., 2021) (see Table 1). Furthermore, we
the case k = 1, because it reduces the beam search
achieve a test-suite execution accuracy of 71.9%
to constrained greedy search.
on Spider’s development set.
Our findings on the CoSQL dialog state trackingAblations In Figure 3, we have condensed our
ablation analysis for P ICARD. We show results for
dataset (see Table 2) are similar to those for Spider.
P ICARD significantly improves the performance, our T5-Large model in all four P ICARD checking
and our fine-tuned T5-3B model achieves state-of- modes and for four different beam sizes on the Spi-
the-art performance. der development set. When checking incrementally
P ICARD is not only improving performance, it at each decoding step, lexing shows a small im-
is also fast. During evaluation of the T5-3B modelprovement over the unconstrained T5 model. The
on Spider, the decoding speed with beam size 4 results without P ICARD and with P ICARD in lex-
on an NVIDIA A100-SXM4-40GB GPU was, on ing mode are largely independent of the beam size.
average, 2.5 seconds per sample without P ICARD This is different when P ICARD is switched into
and 3.1 seconds per sample with P ICARD. the more sophisticated parsing modes. Both, with
and without guards, improvements from P ICARD
Beam Size Figure 1 shows results on Spider with- increase rapidly for increasing beam sizes, where
out and with P ICARD when parsing with guards parsing with guards clearly has a strong lead over
9898
Development Test
System EM% EX% EM% EX%

BRIDGE v2 + BERT (ensemble) (Lin et al., 2020) 71.1 70.3 67.5 68.3
S M B O P + G RA PPA† (Rubin and Berant, 2021) 74.7 75.0 69.5 71.1
RATSQL + GAP† (Shi et al., 2021) 71.8 - 69.7 -
DT-Fixup SQL-SP + RO BERTA† (Xu et al., 2021) 75.0 - 70.9 -
LGESQL + ELECTRA† (Cao et al., 2021) 75.1 - 72.0 -
T5-Base (Shaw et al., 2021) 57.1 - - -
T5-3B (Shaw et al., 2021) 70.0 - - -
T5-Base (ours) 57.2 57.9 - -
T5-Base+P ICARD 65.8 68.4 - -
T5-Large 65.3 67.2 - -
T5-Large+P ICARD 69.1 72.9 - -
T5-3B (ours) 69.9 71.4 - -
T5-3B+P ICARD 74.1 76.3 - -
T5-3B† 71.5 74.4 68.0 70.1
T5-3B+P ICARD† 75.5 79.3 71.9 75.1

Table 1: Our results (bottom) and relevant prior art (top) on the Spider text-to-SQL task. Shown are the exact-set-
match accuracy (EM) and execution accuracy (EX) percentages on Spider’s development and test sets. Our results
are for a beam of size 4, and P ICARD is parsing with guards for the top-2 token predictions. A dagger (†) indicates
use of database content, otherwise schema only.

Development Test
System QM% IM% QM% IM%
RATSQL + SC O R E (Yu et al., 2021) 52.1 22.0 51.6 21.2
T5-3B 53.8 21.8 51.4 21.7
T5-3B+P ICARD 56.9 24.2 54.6 23.7

Table 2: Our results (bottom) and relevant prior art (top) on the CoSQL dialog state tracking task. Shown are the
question match accuracy (QM) and interaction match accuracy (IM) percentages on CoSQL’s development and
test sets. Our results are for a beam of size 4, and P ICARD is parsing with guards for the top-2 token predictions.

parsing without them. large pre-trained language models. On both, the


In order to compare P ICARD with the filtering- Spider cross-domain and cross-database text-to-
by-validity approach of Suhr et al. (2020) and Lin SQL dataset and the CoSQL SQL-grounded dialog
et al. (2020), we have studied also what happens state tracking dataset, we find that the P ICARD de-
when P ICARD is only checking hypotheses when coding method not only significantly improves the
the model predicts their finalization with the end-of- performance of fine-tuned but otherwise unmodi-
sequence token.2 In this restrained mode, P ICARD fied T5 models, it also lifts a T5-3B model to state-
is still effective, but much less so compared to of-the-art results on the established exact-match
normal incremental operation. The gap between and execution accuracy metrics.
these two modes of operation only begins to shrink
for large beam sizes. This is understandable since Acknowledgements
Lin et al. (2020) used beam sizes of at least 16
and up to 64 to reach optimal results with filtering We thank Lee Zamparo for his contributions to the
while Suhr et al. (2020) used a beam of size 100. experiments on the CoSQL dataset. Further, we
would like to thank Pete Shaw for his input on
4 Conclusion the reproduction of the T5 results on Spider. We
would also like to extend our gratitude to Tao Yu
We propose and evaluate a new method, P ICARD, and Yusen Zhang for their efforts in evaluating our
for simple and effective constrained decoding with model on the test split of the Spider and CoSQL
2
datasets. Finally, we thank our anonymous review-
This is not exactly equivalent to filtering a completely
finalized beam, because the hypotheses rejected by P ICARD ers for their time and valuable suggestions.
never enter it and never take up any space.
9899
References In International Conference on Machine Learning,
pages 4596–4604. PMLR.
Ruisheng Cao, Lu Chen, Zhi Chen, Yanbin Zhao,
Su Zhu, and Kai Yu. 2021. LGESQL: Line graph en- Peng Shi, Patrick Ng, Zhiguo Wang, Henghui Zhu,
hanced text-to-SQL model with mixed local and non- Alexander Hanbo Li, Jun Wang, Cicero Nogueira
local relations. In Proceedings of the 59th Annual dos Santos, and Bing Xiang. 2021. Learning con-
Meeting of the Association for Computational Lin- textual representations for semantic parsing with
guistics and the 11th International Joint Conference generation-augmented pre-training. In Proceedings
on Natural Language Processing (Volume 1: Long of the AAAI Conference on Artificial Intelligence,
Papers), pages 2541–2555, Online. Association for volume 35, pages 13806–13814.
Computational Linguistics.
DongHyun Choi, Myeong Cheol Shin, EungGyun Kim, Alane Suhr, Ming-Wei Chang, Peter Shaw, and Ken-
and Dong Ryeol Shin. 2021. RYANSQL: Re- ton Lee. 2020. Exploring unexplored generalization
cursively Applying Sketch-based Slot Fillings for challenges for cross-database semantic parsing. In
Complex Text-to-SQL in Cross-Domain Databases. Proceedings of the 58th Annual Meeting of the Asso-
Computational Linguistics, 47(2):309–332. ciation for Computational Linguistics, pages 8372–
8388, Online. Association for Computational Lin-
Daan Leijen and Erik Meijer. 2001. Parsec: Direct guistics.
style monadic parser combinators for the real world.
Technical Report UU-CS-2001-27. User Model- Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr
ing 2007, 11th International Conference, UM 2007, Polozov, and Matthew Richardson. 2020. Rat-sql:
Corfu, Greece, June 25-29, 2007. Relation-aware schema encoding and linking for
text-to-sql parsers. Proceedings of the 58th Annual
Kevin Lin, Ben Bogin, Mark Neumann, Jonathan Be- Meeting of the Association for Computational Lin-
rant, and Matt Gardner. 2019. Grammar-based neu- guistics.
ral text-to-sql generation.
Peng Xu, Dhruv Kumar, Wei Yang, Wenjie Zi, Keyi
Xi Victoria Lin, Richard Socher, and Caiming Xiong. Tang, Chenyang Huang, Jackie Chi Kit Cheung, Si-
2020. Bridging textual and tabular data for cross- mon J.D. Prince, and Yanshuai Cao. 2021. Optimiz-
domain text-to-sql semantic parsing. Findings of the ing deeper transformers on small datasets. In Pro-
Association for Computational Linguistics: EMNLP ceedings of the 59th Annual Meeting of the Associa-
2020. tion for Computational Linguistics and the 11th In-
ternational Joint Conference on Natural Language
Bryan O’Sullivan and Ben Gamari. 2021. attopar- Processing (Volume 1: Long Papers), pages 2089–
sec: Fast combinator parsing for bytestrings and text. 2102, Online. Association for Computational Lin-
Software available on the Haskell package reposi- guistics.
tory.
Pengcheng Yin and Graham Neubig. 2018. Tranx: A
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
transition-based neural abstract syntax parser for se-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
mantic parsing and code generation. Proceedings of
Wei Li, and Peter J Liu. 2020. Exploring the lim-
the 2018 Conference on Empirical Methods in Natu-
its of transfer learning with a unified text-to-text
ral Language Processing: System Demonstrations.
transformer. Journal of Machine Learning Research,
21:1–67. Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue,
Ohad Rubin and Jonathan Berant. 2021. SmBoP: Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze
Semi-autoregressive bottom-up semantic parsing. In Shi, Zihan Li, Youxuan Jiang, Michihiro Yasunaga,
Proceedings of the 2021 Conference of the North Sungrok Shim, Tao Chen, Alexander Fabbri, Zifan
American Chapter of the Association for Computa- Li, Luyao Chen, Yuwen Zhang, Shreya Dixit, Vin-
tional Linguistics: Human Language Technologies, cent Zhang, Caiming Xiong, Richard Socher, Wal-
pages 311–324, Online. Association for Computa- ter Lasecki, and Dragomir Radev. 2019. CoSQL: A
tional Linguistics. conversational text-to-SQL challenge towards cross-
domain natural language interfaces to databases. In
Peter Shaw, Ming-Wei Chang, Panupong Pasupat, and Proceedings of the 2019 Conference on Empirical
Kristina Toutanova. 2021. Compositional general- Methods in Natural Language Processing and the
ization and natural language variation: Can a se- 9th International Joint Conference on Natural Lan-
mantic parsing approach handle both? In Proceed- guage Processing (EMNLP-IJCNLP), pages 1962–
ings of the 59th Annual Meeting of the Association 1979, Hong Kong, China. Association for Computa-
for Computational Linguistics and the 11th Interna- tional Linguistics.
tional Joint Conference on Natural Language Pro-
cessing (Volume 1: Long Papers), pages 922–938, Tao Yu, Rui Zhang, Alex Polozov, Christopher Meek,
Online. Association for Computational Linguistics. and Ahmed Hassan Awadallah. 2021. Score: Pre-
training for context representation in conversational
Noam Shazeer and Mitchell Stern. 2018. Adafactor: semantic parsing. In International Conference on
Adaptive learning rates with sublinear memory cost. Learning Representations.
9900
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
ing Yao, Shanelle Roman, and et al. 2018. Spider: A
large-scale human-labeled dataset for complex and
cross-domain semantic parsing and text-to-sql task.
Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing.
Ruiqi Zhong, Tao Yu, and Dan Klein. 2020. Seman-
tic evaluation for text-to-sql with distilled test suites.
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP).

9901

You might also like