Natural Language To SQL in Low-Code Platforms

The document describes a natural language to SQL pipeline for low-code platforms. It collects, labels, and validates data from public datasets, crowdsourcing, and production to train an NL model to generate SQL queries. It develops a UI that allows developers to input an NL query and receive the corresponding SQL representation. Through A/B testing, the pipeline achieves a 240% improvement in adoption, 220% in engagement, and 90% decrease in failure rate compared to the initial model.

Uploaded by

nvk1211

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views11 pages

Natural Language To SQL in Low-Code Platforms

Uploaded by

nvk1211

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Natural language to SQL in low-code platforms

Sofia Aparicio∗ Samuel Arcadinho∗ João Nadkarni

Tripadvisor, Portugal Zendesk, Portugal OutSystems, Portugal

David Aparício∗ João Lages Mariana Lourenço∗

Zendesk, Portugal OutSystems, Portugal RedLight, Portugal

Bartłomiej Matejczyk and Filipe Assunção†

OutSystems, Portugal

Abstract We propose a solution for developers where they

express their intent in the form of NL, which is
One of the developers’ biggest challenges in
low-code platforms is retrieving data from a automatically converted to an aggregate. Our ap-
database using SQL queries. Here, we propose proach addresses challenges in production settings
arXiv:2308.15239v1 [cs.AI] 29 Aug 2023

a pipeline allowing developers to write natural like multiple data sources, extensive data model
language (NL) to retrieve data. In this study, schemes, high throughput requirements, and UI
we collect, label, and validate data covering design. Previous studies have not applied this ap-
the SQL queries most often performed by Out- proach to production settings (Arcadinho et al.,
Systems users. We use that data to train a NL
2022; Scholak et al., 2021; Wang et al., 2019; Yu
model that generates SQL. Alongside this, we
describe the entire pipeline, which comprises a
et al., 2019, 2018). We put forward a production
feedback loop that allows us to quickly collect pipeline for the task of the Natural Language to
production data and use it to retrain our SQL SQL (NL2SQL) task. We explore different models
generation model. Using crowd-sourcing, we for SQL generation, and use a data-driven approach
collect 26k NL and SQL pairs and obtain an to collect the right labels for supervised classifica-
additional 1k pairs from production data. Fi- tion according to production data distribution. We
nally, we develop a UI that allows developers also design a UI (see Figure S1 for an example) for
to input a NL query in a prompt and receive
developers to write the NL and receive the corre-
a user-friendly representation of the resulting
SQL query. We use A/B testing to compare sponding SQL query as an aggregate.
four different models in production and observe We conduct offline experiments for the NL2SQL
a 240% improvement in terms of adoption of task and achieve a good balance between accuracy
the feature, 220% in terms of engagement rate, and response time using a curated in-house test set
and a 90% decrease in failure rate when com- that mimics our users’ usage distribution. Addi-
pared against the first model that we put into
tionally, we perform online A/B testing to compare
production, showcasing the effectiveness of our
pipeline in continuously improving our feature. model performance. Here, the model is evaluated
according to adoption, engagement, and failure rate.
1 Introduction Our SQL generation model, based on T5 with con-
Low-code platforms, such as Service Studio pro- strained decoding (as described in Arcadinho et al.
vided by OutSystems, are widely used for applica- (2022)) increases engagement by 220%, adoption
tion and web development, specially by developers by 240%, and decreases the failure rate by 90%
with limited engineering backgrounds (Sahay et al., when compared against a baseline model that we
2020). Data retrieval in low-code platforms is par- initially deployed, based on (Wang et al., 2019).
ticularly important: it is one of the most searched Thus, our main contribution is a data retrieval
topics (Al Alamin et al., 2021) and represents 50% pipeline for low-code developers with no SQL
of development time according to our internal re- background. Our pipeline performs data collec-
ports. In OutSystems, developers retrieve data from tion, SQL generation, and model retraining. As
databases using an aggregate, a visual representa- far as we know, no previous study showcased a
tion of a SQL query. In appendix A.1 we describe complete pipeline. We also produce an efficient
the different components of an aggregate. model for NL2SQL: it has high accuracy, can han-
∗ dle large data model schemas, and rarely generates
Work done while working at OutSystems.
†
For questions about the paper reach out to fil- invalid SQL. Finally, we describe a UI that given
ipe.assuncao@outsystems.com NL queries provides the users with SQL queries in
a simplified visual representation (an aggregate). datasets. Thus, if we simply evaluate models
The remainder of the paper is organized as fol- on Spider, we do not have an accurate estima-
lows. Section 2 describes the pipeline and its differ- tion of the model performance in production.
ent components. Section 3 details the experiments.
We discuss related work in Section 4 and present 2.1.2 Crowdsourced labelled data
our conclusions in Section 5. To mitigate the points discussed above, we com-
plement Spider’s data with additional labeled data.
2 Method We have few pairs of utterances to SQL from pro-
We develop a pipeline to produce a visual repre- duction, but we have many SQLs created by users.
sentation of SQL (i.e., an aggregate) from pairs of Thus, we use crowdsourcing to label SQLs cre-
textual utterances and data model schemas. Our ated by developers with the corresponding NL, as
pipeline has three main sections: data collection, suggested by Roh et al. (2019). We use MTurk
model training, and model deployment (Figure 1). since it allows us to select workers with technical
backgrounds. We collect 26k pairs of SQL and the
2.1 Data collection corresponding NL utterance. Figure S2 shows how
We combine data from three sources: public the MTurk was shown to labellers.
datasets, datapoints collected from Amazon Me- Since the primary goal of the data collection is to
chanical Turk (MTurk), and data from production. mimic our users’ SQL usage distribution, we need
to label a representative set of examples. However,
2.1.1 Public data SQLs are very granular, and thus we cannot simply
We use data from Spider (Yu et al., 2018) since it is find the k-most frequent SQLs. Thus, we develop
a cross-domain public dataset with a diverse set of templates that capture the overall structure of the
SQL query structures of varying difficulty. Spider SQL query (e.g., "SELECT * FROM [TABLE]
comprises ∼ 6k SQL queries and ∼ 10k questions JOIN [TABLE] on [COL] = [COL]"). We find
across 200 databases. While useful in production the most common SQLs, and sample queries from
and for comparing our method with other state-of- them according to the global distribution. Then, we
the-art approaches, we note that Spider deviates send queries to MTurk according to the templates’
from our requirements in the following points: distribution.
We validate queries for model training using a
1. We can only rely on the tables and column
semi-automatic process to identify users with bad
names (i.e., the database schema) because the
answers. In practice, we manually validate triggers
data of our users is confidential. Spider, how-
from the automatic validation process and create a
ever, contains data points where the utterance
denylist for future tasks.
depends on the values in the database for the
creation of the appropriate SQL query. We use the features provided by MTurk and do
feature engineering on the labeled pairs. The result-
2. The database schemas of our users’ applica- ing features will be described on the appendix A.2.
tions are larger than those found in Spider (see We use a decision tree classifier to select the
Figure 2). Models, like RATSQL (Wang et al., threshold score to use as a trigger, i.e., if the NL
2019), cannot handle our real-world database has a score above this threshold, the reviewer needs
schemas due to their size. On Spider dataset, to manually validate the NL. After each crowd-
we lack that representativeness. sourcing iteration, we also automatically analyze
the workers who had a bad performance on our
3. Spider’s table and column names differ signif-
tasks and provided feedback to allow them to im-
icantly from those in our real-world databases
prove their labeling. If it was the second time that
and, in general, are less complex.
the worker had a bad performance, we exclude
4. OutSystems-specific properties (such as vari- them from future tasks. From all the experiments,
ables and functions, not native to SQL) need we collect a total of 26k valid data points.
to be predicted by the model. These properties 2.1.3 Data from production
are, obviously, not found on public datasets.
Having deployed the trained model to production,
5. The most commonly asked questions in our we then start the data collection. We start by analyz-
real-world setting are not present in public ing the model performance on each template. Then,
Public Datasets Pipeline
Model Training

Public Data
Pipeline to Upload
Data points in
Load PreTrained
model
Snowflake

MTurk Pipeline
Retrain model v
E aluate o n
Run

with new data Test set y

Anal sis

Generate

Data to Update

Samples from
to MTurk
Production Data

The model with

Upload Validated better results until
Quality
data to internal now
Estimation
database
MTurk Users
labeling data Deployment

Model in Extra manual

Model in QA
Production validation

Expert
Validation

Production Labeling Pipeline

Upload Validated IDE

Samples from Quality
Production Data Estimation
data to internal
database y
Anal se

Templates that

need more data

Figure 1: Machine Learning pipeline that comprises the three main components, data collections (on the left), model
training (top right corner), and model deployment (bottom right corner).

we select the ones with the worst performance and We select only the interactions where the user
collect the correspondent SQLs. To be noted that made changes to the predicted SQL to retrain the
we only select SQLs that were edited by our users model. However, this presents challenges since the
and the corresponding NL. final SQL is not always correct for the utterance
In the end, we will have several SQLs for the (case 4), and the user might change their mind
same NL, where each SQL corresponds to an edi- during development (case 3). As such, we build a
tion made by the user, and we need to ensure which system that can filter out wrong pairs of SQL and
one of the SQLs corresponds to the given NL. We utterances and select the correct ones. To this end,
run the quality estimator (described in Section 2.2) we employ two different systems, one based on
to help with the review. After that, the data is up- rules and another based on a quality estimator.
loaded to an internal labeling tool to be labeled.
Finally, we add that data to the training set and 2.2.1 Rule-based systems
retrain the model. This system uses the same features described in the
previous section, with the only change being the
2.2 Quality estimation weights in the decision tree.
As stated in the previous section, for each pair
of utterances and predicted SQL query, we collect 2.2.2 ML-based estimator
all the changes that the user did to the SQL query. Following the work done in (Arcadinho et al.,
We use this information to retrain the model. The 2022), we fine-tune CodeBERT (Feng et al., 2020)
interaction can be divided into four types: to detect if a SQL query matches an utterance or
1. The model predicts the right SQL, and the not. For this, we use the training dataset of our NL
user does not change anything. to SQL model and generate possible SQL for each
utterance. Then, we annotate the pairs as being true
2. The model predicts the wrong SQL, and the if the SQL matches the gold SQL or if the query re-
user corrects it. sult is the same as the gold one. Having this model
fine-tuned, for each interaction, we compute the
3. The model predicts the right SQL. The user
probability of each SQL to be a pair of the given
changes their mind and changes the SQL.
utterance and select the pair with the highest score.
4. The model predicts the wrong SQL. However, Finally, we filter out all the pairs that have a score
the user doesn’t change it. below a given threshold.
2.3 SQL generation 2.4 Model Training
We use as input the NL query of the collected After collecting data, the next step is to retrain
SQLs (see Section 2.1) and our goal is to gener- the model. This section will focus on the training
ate the corresponding SQL as output. Figure S3 part and explain the strategy used for retraining the
illustrates the full pipeline of our SQL generation. model. In the evaluation step, we will detail the
metrics used to evaluate them and their limitations.
T5QL (Arcadinho et al., 2022) is a Seq2Seq
method that converts the input sequence into the 2.4.1 Training
final SQL. For instance, the input of Figure S3 is As explained in Section 2.1, the model is trained
converted to "Give me the users sorted by country using a new version of the training dataset in each
| User, Account: country" and the correct SQL is training iteration. Data points from the SQL tem-
"from User select * order by country". plate where the model had the worst performance
T5QL comprises a generator (a T5 model (Raf- are added to the train set.
fel et al., 2020)) and a ranker (a CodeBERT By doing this, we perform active learning where
model (Feng et al., 2020)); we note, however, that the query strategy is clustering (Zhan et al., 2022)
T5QL is easily adaptable to use other models as on the predicted SQL template. The cluster selec-
either the generator or the ranker. The generator is tion is based on the previous model’s performance
trained to output SQLs that correspond to the input and the user feedback, like the average number of
sequence. We use beam search to output several changes it did on a prediction.
candidates, and the ranker’s task is to re-rank the After collecting the data from the previously
generator’s candidates. We observe that using a trained model, we fine-tune the model.
re-ranker can improve the generator’s performance
for SQL generation in (Arcadinho et al., 2022). 2.4.2 Evaluation
As in traditional Seq2Seq frameworks, the gener- When evaluating a trained model, it is important to
ator iteratively generates a token using as input the consider the metrics computed (e.g., the model’s
input sequence and the generation up to that point. accuracy), and the metrics that compute the infer-
However, unlike traditional methods, in T5QL next ence time and the overall experience observed after
token generation is constrained to produce only deploying into production. We divide the computed
valid SQL. To guarantee this property, we build metrics into two types: model performance metrics,
a context-free-grammar (CFG) of SQL statements and server performance metrics.
and use the parser’s lookahead to fetch valid contin- To evaluate model performance, we use a com-
uations (i.e., tokens) to the current generation; then, bination of metrics that comprise: exact match,
we mask all invalid tokens, thus only allowing the execution match (Zhong et al., 2020), and Tree
generator to generate valid tokens. Edit Distance (TED).
The generator outputs multiple candidates by The first two metrics are the norm when eval-
using beam search. We use the ranker to learn to uating Text-to-SQL models in the Spider bench-
re-rank the candidates by giving it as input a pair of mark (Yu et al., 2018). However, it is hard to relate
NL and SQL and a label informing the ranker if the these metrics with user perception. For instance,
two correspond to each other. To train the ranker, one query that has an error in the where conditions
we give it multiple pairs where only one of them has the same score as a completely wrong SQL.
is correct and the others are: eleven generations The exact match and execution match are binary,
using beam search and two random SQLs from which removes any concept of distance between
the training data. We train CodeBERT in a cross- SQL queries. With the objective of solving these
encoding setting. problems, we used the TED between the Abstract
syntax tree (AST) of the gold and predicted SQLs.
Since, in real-world scenarios, the data model
Using the TED metric, we can compute the number
schemas can be very big (because users can have
of changes one needs to perform to transform one
many tables and columns), we use LongT5 (Guo
SQL into the other, which closely relates to the
et al., 2021) in production as it uses sparse attention,
perception of error seen by the user. Furthermore,
which allows us to have bigger input sizes without
we also give different weights when computing the
the need of large GPUs.
TED. For example, we penalize more an error in
the selected tables rather than a selected column.
The weights were selected based on the number of 6. If the integration tests pass, the QA environ-
clicks a user needs to perform to correct the span. ment is updated, and a new production candi-
Looking at the server performance, we also com- date release is created.
pare the response time on the test dataset by look-
7. When we feel comfortable in advancing the
ing at the 50%, 90%, and 99% percentiles. This is
changes that are in QA, we manually approve
done because some improvements could impact the
the production release candidate that was cre-
response time. For instance, bigger models have a
ated, which will trigger our CD pipeline that
slower response time.
will put these changes in Production.
Finally, as seen in Figure 1, all this analysis is
done at the template level, where we then average For us it’s crucial that step 7 of the above list
using the micro and the macro averages. Doing this happens without downtime, meaning that we can
helps us select the next set of SQL templates we deploy a new version of the service without making
should collect to improve our model. it unavailable for the users. All of our deployment
strategies follow this principle. Various deploy-
2.5 Deployment ment strategies follow this, and in practice, we
This section will refer to deployments as intro- alternate between three different deployment strate-
ducing a new service version. It should be noted gies based on the need and release content:
that new models result in new service versions and
changes in the pre/post-processing pipelines (these • Rolling Update: this is the simplest approach
steps and the model forward pass constitute our to update our service. It will simply replace
model pipeline) and in the API code. We use Ama- the old version of the service with the new
zon S3 as our model registry and FastAPI as the instance (Kubernetes pod) at a time. With this
python web framework to build the API to call our strategy, users are served by the new service
model pipeline. This API, alongside the model version in a few minutes. In practice, we just
pipeline code, is containerized in our service im- use this strategy for low-risk feature bug fixes.
age, which runs in a Kubernetes infrastructure (we • Progressive Rollout: This strategy is similar
use Amazon EKS to run Kubernetes on AWS). to rolling update, but only updates a fraction
We use a standard GitOps-based (Beetz and Har- of instances instead of all of them. For ex-
rer, 2022) CI/CD approach to automate the deploy- ample, a progressive rollout starting at 20%
ments of the service. The high-level sequence of involves updating 20% of resources with the
events that happen, from implementing a change new model version, receiving 20% of incom-
in the service code to having it in production, is ing requests. The remaining resources and
described below. requests are kept for the old version of the ser-
vice. Once we are confident that the new ser-
1. When we are ready to release a new change vice version is responding well, we advance
in the service (be it a new model, post/pre- this progressively (looking at the service met-
processing approach, or bug fix), we need to rics, like errors and latency, for the new ver-
create a pull request in GitHub. sion before progressing), focusing on service
2. Creating this PR will trigger our CI pipeline metrics like errors and latency, until 100% re-
to run the unit tests on top of our changes. quests and resources are being used by the
new version. This approach minimizes de-
3. If the tests pass, then the model is automati- ployment risk by only progressing to more
cally placed in the Development environment, users after ensuring the service is respond-
where we can test it out through an API. ing as expected. This strategy is used when
deploying code changes with risk, such as op-
4. Once we are at a stage where we feel com- timizations in pre/post-processing pipelines or
fortable with passing these new changes to service API code, without a new feature.
the Quality Assurance (QA) environment, we
merge the pull request into the main branch. • A/B testing: in the A/B testing deployment
strategy, we updated a fraction of the instances
5. Merging the pull request into the main branch with the new model version, but unlike the pro-
automatically triggers integration tests. gressive rollout strategy, this fraction doesn’t
increase progressively. Instead, an experiment tential impact on a sample of the users. In this
with this split fraction (which we usually have section, we analyze the performance of a set of
at 50%) is started. After statistical signif- models. First, we perform offline experiments on
icance is reached and a winning feature is an internal test set. Then, we share the results com-
found, the experiment ends. All the resources puted in a sample of users through A/B testing.
are updated (with the usual rolling update) so All the results are compared against a baseline, the
that the users start being only served by the RATSQL+GAP (Shi et al., 2021) fine-tuned in our
winning feature. Thus, this deployment strat- internal dataset. The composition of the internal
egy is associated with starting an experiment dataset is shown in Table 1. We note that the vali-
to measure the impact (based on business met- dation dataset is a subset of the train dataset, thus
rics) of the new service version. This is the we omit it.
most common deployment strategy we use We run our models for all the tests in a
when deploying a new model or constraining G4DN.xlarge machine with a Tesla V100 with
the decoding version. We do this because we 16GB of GPU memory. This setup allows the test
want to go beyond the offline metrics we eval- to run in less than 12h.
uated in the lab and check if the new model
can positively impact the users. Some A/B 3.1 Offline experiments
testing experiments can be found in 3.2. Our first deployed model was a version of RAT-
SQL+GAP, which we refer to as baseline hence-
2.6 Feedback Loop
forth. We achieved two major improvements since
Once our model is deployed and serves our cus- the first deploying this model. The first improve-
tomers, we start to have user feedback on our fea- ment was to use T5QL; (Arcadinho et al., 2022)
ture. It’s crucial to gather this user feedback to set shows that augmenting T5 with Constrained De-
up a feedback loop where we use this user feed- coding (CD) and that adding a re-ranker of the
back data to retrain our model. This feedback loop generator’s predictions can boost the performance
sets up a virtuous cycle of data: the user feedback of small models (e.g., T5-Base). The second major
data is used to create better models, which make improvement was the introduction of context rules
the product receives more usage and, respectively, into CD, which constrains even more the genera-
more user feedback. That feedback can be used to tion based on the context, and reducing the numer-
improve the model’s performance further. ical precision of the model. The results of each
There are two types of user feedback, implicit improvement are summarized in Table 2 and dis-
feedback and explicit feedback. Both have advan- cussed next.
tages and drawbacks and sometimes combining
both can lead to the best results, as observed in 3.1.1 Maximum input size
(Zhao et al., 2018). For this feature, the desired The max input size is where we see the biggest
experience dictated that we only have user-implicit improvement. This is justified by the fact that the
feedback. We collect user feedback through the RATSQL+GAP (Shi et al., 2021) uses a fine-tuned
product’s telemetry data. Then, we can observe if version of the BART encoder (Lewis et al., 2019),
the user maintained the suggestion provided by the which has a maximum input size of 1024 tokens.
model or if the suggestion was changed or deleted. However, our token length distribution shows that
When the suggestion is changed, we also collect the real data has instances with more tokens than
the final state of the user’s SQL query. what the baseline model supports (see Figure 2).
We correlate these telemetry events, which we Using the settings described previously, we in-
can use to find our gold SQL (the final SQL query crease the input size from 1024 tokens to 20k to-
after the user’s editions). With the input data, our
service stores each request (the NL query and data Table 1: Internal dataset used in the offline experiment.
model schema). By merging these data sources, we
get our labeled data through user implicit feedback. Train Test
Instances 29 000 4 860
3 Experiments Public 25.86% 0%
As mentioned in Section 2.5, before releasing a Crowdsourcing 73.26% 97.84%
new model for all users, we first evaluate its po- Internal 1.88% 2.16%
kens. This limit was further pushed to 26k by al- ·104
lowing the model to run on FP16 precision.
2

Count
3.1.2 Response time
The response time measures each model’s aver- 1
age time to answer a request. In Table 2, one can
see that the first iteration of the T5QL model in- 0
creased the response time of our model, however, 0 2,000 4,000
by reducing the precision of our model we achieve Number of tokens
competitive results with the baseline.
Figure 2: Input size for a sample of our data. The sample
3.1.3 Errors used had a size of 50000.
Finally, analyzing the TED of each improvement,
one can observe that by reducing the numerical pre-
cision of the model, the TED also increases. This An aggregate is successful when it is not deleted,
is expected since FP16 introduces small rounding edited, or registered with errors.
errors that affect the final prediction. Nevertheless,
the final error rate is still significantly better than 3.2.3 Failures
our defined baseline by ≈ 654%. Percentage of invalid aggregates relative to the total
number of suggested aggregates in the last four
3.2 Online experiments
weeks. An aggregate is considered invalid when it
After the deployment of the model to production doesn’t meet all the validation criteria.
and during the A/B testing phase, we have a dash-
board to track the experiment and monitoring sys- 3.2.4 Results
tems to detect failures. We measure adoption, en- In Table 3, we present the results for two A/B
gagement, and success during the experiments. In testing experiments. In the first experiment, we
the following section, we explain each metric and compare our baseline (RATSQL) against T5QL,
detail the results in the following subsections. while in the second experiment, we compare a half-
3.2.1 Adoption precision version of T5QL, which we call T5QL +
FP16, to our initial T5QL model (which is in full
Measured by two different metrics: the adoption
precision). As we can see in Table 3, the first ex-
rate and the adoption funnel. The adoption rate is
periment validates the considerable improvement
the percentage of users who have used the Natu-
of T5QL compared to our baseline. Although the
ral Language to Aggregate (NL2Agg) feature, the
latencies are similar (our baseline has a slightly
percentage of users that created aggregates using
inferior latency in the 99th percentile but a slightly
natural language relative to the total number of
bigger median latency) the number of failures con-
users exposed to the feature in the last four weeks.
siderably decreased and the business metrics (i.e.,
3.2.2 Engagement adoption and engagement) also had considerable
Percentage of successful aggregates of the total improvements. On the other hand, on the second ex-
number of created aggregates in the last four weeks. periment, our control group (T5QL) and the group
of users with the new feature (T5QL+FP16) don’t
show statistically significant differences (given the
Table 2: Deployed models performance. The baseline is
RATSQL+GAP (Wang et al., 2019), T5QL is the model sample size collected and the differences observed)
proposed in (Arcadinho et al., 2022) and CD+FP16 is in the business metrics. We can observe that the
the same as (Arcadinho et al., 2022) but with added 99th percentile of latencies decreased considerably
business rules and lowered precision. in half-precision, but that didn’t correlate to the
expected improvement in the business metrics.
Max input Response TED The roll-out plan until putting the feature into
size Time (s) General Availability (GA) was done in 3 different
Baseline 1 024 2.12 6.93 steps. First, we opened the feature to 40% of the
T5QL 20 000 10.37 0.89 users, 15 days later was opened to 70%, and 15
T5QL + FP16 26 000 2.38 1.06 days after we released it to all OutSystems users.
Table 3: Relative results of A/B testing compared with the baseline.

Latency
Median p99 Weekly Adoption Engagement Failures
Baseline 1.77s 5.31s 9.6% 1.7% 35%
Experiment 1
17.0% 3.7% 6%
T5QL 1.45s 5.90s
(1.8x) (2.2x) (0.2x)
Experiment 2 T5QL 0.87s 7.91s 23.5% 3.6% 4.2%
22.8% 3.7% 3.4%
T5QL + FP16 0.92s 5.45s
(0.97x) (1.03x) (0.81x)

4 Related Work textual output (He et al., 2019) to generating code

from fragments (Cheung et al., 2012) or even giv-
To build the NL2Agg pipeline and put it in pro- ing a database and an example of an output ta-
duction with the model with the best performance ble (Orvalho et al., 2020). Some authors (Wang
possible, we explored two main concepts: hybrid- et al., 2019) defend the need to use an intermedi-
centric development and A/B testing. Our pipeline ate representation by combing the schema entities
relies heavily on the data, thus allowing us to have and the words of the natural language question
a curated dataset to train our model and a SQL using relation-aware self-attention. SmBoP (Ru-
template distribution that mimics our users’ us- bin and Berant, 2020) uses a semi-regressive de-
age. After this, we tested different models and coder instead of the autoregressive decoder em-
train datasets. Adopting A/B testing allowed us to ployed by RATSQL (Wang et al., 2019), reducing
drastically reduce the need to test many different the number of generation steps since each decod-
approaches. Finally, a crucial aspect of our work is ing step can generate many subtrees. Recently,
the NL2SQL model performance. In this section, PICARD (Scholak et al., 2021) and T5QL (Ar-
we review the three concepts and explore previous cadinho et al., 2022) constrained the generation
works on these research areas. of Large Language Models (LLMs) to generate
A/B testing only syntactically correct SQL. By doing this, both
works reduce the number of hallucinations of these
A/B testing has been used for website develop- models perform, which boosts accuracy.
ment since the rise of the Internet (Kohavi and
Longbotham, 2017). This technique relies on do- 5 Conclusion
ing online-controlled experiments by providing a
random set of users a control version of the web- This work presents an end-to-end NL2SQL
site nominative A. The other set of users will be pipeline, focusing on applying the best model for a
presented with version B, with some alterations to specific use case and utilizing a crowdsourced train-
the previous version. Previous research was done set mimicking our customers’ needs. Our pipeline
in this area to explore online A/B testing aligned also includes a feedback loop to retrain the model
with ML model deployment (Baltescu et al., 2022; for specific SQL templates where the SQL genera-
Li et al., 2021). In most cases, the main metric tion model is not performing as well. The pipeline
to be considered is users’ engagement, such as the was evaluated in two settings: offline and online.
number of clicks or pagination. Sometimes logging Offline experiments involved a test set mimicking
is also considered (Xu et al., 2015), especially on users’ data distribution, resulting in a significantly
specific parts of the website that are being used, improved model. The model was further improved
enabling triggered analysis. by augmenting it with business rules and creating a
new TED metric. Online tests showed an increase
NL2SQL of 230% in adoption, 220% in engagement rate,
Code generation has been a widely explored area and a 90% reduction in error rate. Future work
in Computer Science, and SQL generation, par- aims to explore model shadowing by putting a new
ticularly, is a very active research area. Different model in general availability (GA) and evaluating it
approaches were proposed in the last years, from offline. This will allow for the estimation of model
enhancing schema representation using BERT con- performance without premature deployment.
References Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: De-
noising sequence-to-sequence pre-training for natural
Md Abdullah Al Alamin, Sanjay Malakar, Gias Uddin, language generation, translation, and comprehension.
Sadia Afroz, Tameem Bin Haider, and Anindya Iqbal. arXiv preprint arXiv:1910.13461.
2021. An empirical study of developer discussions
on low-code software development challenges. In Paul Luo Li, Xiaoyu Chai, Frederick Campbell, Jilong
2021 IEEE/ACM 18th International Conference on Liao, Neeraja Abburu, Minsuk Kang, Irina Niculescu,
Mining Software Repositories (MSR), pages 46–57. Greg Brake, Siddharth Patil, James Dooley, et al.
IEEE. 2021. Evolving software to be ml-driven utilizing
real-world a/b testing: experiences, insights, chal-
Samuel Arcadinho, David Aparício, Hugo Veiga, lenges. In 2021 IEEE/ACM 43rd International Con-
and António Alegria. 2022. T5ql: Taming lan- ference on Software Engineering: Software Engineer-
guage models for sql generation. arXiv preprint ing in Practice (ICSE-SEIP), pages 170–179. IEEE.
arXiv:2209.10254.
Pedro Orvalho, Miguel Terra-Neves, Miguel Ven-
Paul Baltescu, Haoyu Chen, Nikil Pancha, Andrew Zhai, tura, Ruben Martins, and Vasco Manquinho. 2020.
Jure Leskovec, and Charles Rosenberg. 2022. Item- Squares: a sql synthesizer using query reverse en-
sage: Learning product embeddings for shopping gineering. Proceedings of the VLDB Endowment,
recommendations at pinterest. In Proceedings of the 13(12):2853–2856.
28th ACM SIGKDD Conference on Knowledge Dis-
covery and Data Mining, pages 2703–2711. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Florian Beetz and Simon Harrer. 2022. Gitops: The Wei Li, Peter J Liu, et al. 2020. Exploring the limits
evolution of devops? IEEE Software, 39(4):70–75. of transfer learning with a unified text-to-text trans-
former. J. Mach. Learn. Res., 21(140):1–67.
Alvin Cheung, Armando Solar-Lezama, and Samuel
Madden. 2012. Inferring sql queries using program Yuji Roh, Geon Heo, and Steven Euijong Whang.
synthesis. arXiv preprint arXiv:1208.2013. 2019. A survey on data collection for machine learn-
ing: a big data-ai integration perspective. IEEE
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Transactions on Knowledge and Data Engineering,
Xiaocheng Feng, Ming Gong, Linjun Shou, Bing 33(4):1328–1347.
Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020.
CodeBERT: A Pre-Trained Model for Program- Ohad Rubin and Jonathan Berant. 2020. Smbop: Semi-
ming and Natural Languages. arXiv e-prints, page autoregressive bottom-up semantic parsing. arXiv
arXiv:2002.08155. preprint arXiv:2010.12412.
Apurvanand Sahay, Arsene Indamutsa, Davide Di Rus-
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi- cio, and Alfonso Pierantonio. 2020. Supporting the
aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, understanding and comparison of low-code develop-
Ting Liu, Daxin Jiang, et al. 2020. Codebert: A ment platforms. In 2020 46th Euromicro Conference
pre-trained model for programming and natural lan- on Software Engineering and Advanced Applications
guages. arXiv preprint arXiv:2002.08155. (SEAA), pages 171–178. IEEE.
Mandy Guo, Joshua Ainslie, David Uthus, Santiago On- Torsten Scholak, Nathan Schucher, and Dzmitry Bah-
tanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. danau. 2021. Picard: Parsing incrementally for
2021. Longt5: Efficient text-to-text transformer for constrained auto-regressive decoding from language
long sequences. arXiv preprint arXiv:2112.07916. models. arXiv preprint arXiv:2109.05093.
Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Peng Shi, Patrick Ng, Zhiguo Wang, Henghui Zhu,
Weizhu Chen. 2019. X-sql: reinforce schema Alexander Hanbo Li, Jun Wang, Cicero Nogueira
representation with context. arXiv preprint dos Santos, and Bing Xiang. 2021. Learning con-
arXiv:1908.08113. textual representations for semantic parsing with
generation-augmented pre-training. In Proceedings
Oana Inel, Khalid Khamkham, Tatiana Cristea, Anca of the AAAI Conference on Artificial Intelligence,
Dumitrache, Arne Rutjes, Jelle van der Ploeg, Lukasz volume 35, pages 13806–13814.
Romaszko, Lora Aroyo, and Robert-Jan Sips. 2014.
Crowdtruth: Machine-human computation frame- Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr
work for harnessing disagreement in gathering anno- Polozov, and Matthew Richardson. 2019. Rat-sql:
tated data. In International semantic web conference, Relation-aware schema encoding and linking for text-
pages 486–504. Springer. to-sql parsers. arXiv preprint arXiv:1911.04942.

Ron Kohavi and Roger Longbotham. 2017. Online Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno,
controlled experiments and a/b testing. Encyclopedia and Anmol Bhasin. 2015. From infrastructure to
of machine learning and data mining, 7(8):922–929. culture: A/b testing challenges in large scale social
networks. In Proceedings of the 21th ACM SIGKDD
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan International Conference on Knowledge Discovery
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, and Data Mining, pages 2227–2236.
Tao Yu, Rui Zhang, He Yang Er, Suyi Li, Eric Xue, do feature engineering on the labeled pairs. The
Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze resulting features are the following:
Shi, Zihan Li, et al. 2019. Cosql: A conversational
text-to-sql challenge towards cross-domain natural • Work time - time spent by the worker (in
language interfaces to databases. arXiv preprint
arXiv:1909.05378. seconds) writing the NL for the input SQL;

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, • Has more than two words - check if the NL
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn- has more than two words;
ing Yao, Shanelle Roman, et al. 2018. Spider: A
large-scale human-labeled dataset for complex and • Is terminal on NL - inspect if all the terminals
cross-domain semantic parsing and text-to-sql task. are present on the NL;
arXiv preprint arXiv:1809.08887.
• SQL Complexity - measure the difficulty of
Xueying Zhan, Qingzhong Wang, Kuan-Hao Huang,
Haoyi Xiong, Dejing Dou, and Antoni B. Chan. 2022. the SQL based on the number of SQL compo-
A comparative survey of deep active learning. ArXiv, nents in the query (Yu et al., 2018). E.g the
abs/2203.13450. query SELECT COUNT(*) FROM cars_data
WHERE cylinders > 4 is classified with a hard-
Qian Zhao, F. Maxwell Harper, Gediminas Adomavi-
cius, and Joseph A. Konstan. 2018. Explicit or im- ness Easy.
plicit feedback? engagement or satisfaction? a field
experiment on machine-learning-based recommender • Time per Complexity - dividing the time
systems. In Proceedings of the 33rd Annual ACM spent on a query by its hardness;
Symposium on Applied Computing, SAC ’18, page
1331–1340, New York, NY, USA. Association for • SQL question Levenshtein distance - calcu-
Computing Machinery. late the Levenshtein distance between the NL
Ruiqi Zhong, Tao Yu, and Dan Klein. 2020. Semantic and the SQL query;
evaluation for text-to-sql with distilled test suites.
arXiv preprint arXiv:2010.02840. • Order by direction - measure if the NL has
the most frequent words to describe the direc-
A Appendix tions (ASC and DESC) of the SQL queries. A
dictionary with the most frequent terms was
A.1 OutSystems Aggregate built;
In an aggregate, the developer can select entities
• Limit Value - similar to the previews vari-
2 and apply a set of filters 3 . The result of the
able. In this one, we construct a dictionary of
aggregate, like the result of a SQL query, is later
words describing the SQL query’s limit. It is
shown to end-users on the screen (e.g., as a list or a
a boolean that determines whether the NL has
table). However, low-code developers are typically
presented the words that state the limit of the
not SQL experts, and thus creating aggregates can
SQL.
be challenging and time-consuming.

A.2 Mturk
The MTurk task is represented in Figure S2.
After the experiment is done we used first ap-
proach to achieve automatic validation, we used
CrowdTruth (Inel et al., 2014). The goal of this
library is to process the results from MTurk and
measure the inter-annotation agreement quality be-
tween the workers and the units (in our case, the
units are the SQL and NL pairs). The framework
uses the inter-dependency between the three main
components of a crowdsourcing system: worker,
input data, and annotation. However, the metrics
calculated by CrowdTruth did not show a corre-
lation between the worker and the resulting NL.
Thus, we use the features provided by MTurk and
Figure S1: OutSystems IDE with the result of a natural language query. We can see the overall experience of
creating an aggregate using NL and how one can correct/change the predicted result. The normal usage of this
feature starts by writing a NL query in 1 , then verify if all the entities selected are correct 2 and checking
possible filters 3 .