Scale Guide To Model Customization
Scale Guide To Model Customization
Customization
A Deep Dive into Fine-Tuning and
Retrieval Augmented Generation
1
Guide to Model Customization
Table of Contents
Data Analysis Use Case: Fine-Tuning GPT-4 3
for Text2SQL
About Scale 29
Introduction
With Scale GenAI Platform, customers use their proprietary data to customize GenAI
applications. Our customers are building use cases like:
• Data analysis and business intelligence applications using fine-tuned LLMs for
Text2SQL to make analysts more efficient and embed a culture of data-driven
decision-making.
To build accurate and effective applications, these use cases require high-quality pro-
prietary data, and expert fine-tuning and retrieval augmented generation (RAG). In this
guide we will go in-depth into what it takes to implement this fine-tuning and RAG to
give you a better understanding of how to use these tools for your own Generative AI
use cases
Scale 2
Guide to Model Customization
Part 1: Fine-Tuning
GPT-4 for Text2SQL
Data Analysis Use Case: Fine-Tuning GPT-4 for
Text2SQL
Our machine learning team at Scale
has recently fine-tuned GPT-4 to
achieve state-of-the-art performance
(84% accuracy) for generalized text-
to-SQL translation on one of the
most popular benchmark datasets,
the SpiderDev Set. In this blog post,
we will discuss why text2sql is an
important use case, why it is hard in
practice, where fine-tuning can help,
how we implemented a real-world
solution, and finally, what our results were.
Scale 3
Guide to Model Customization
creates a massive bottleneck and reliance on data analysts to build these tools, which
then proliferate and become hard to navigate. Multi-billion dollar industries have
emerged to provide generalist or highly specialized analytics tools to bridge this gap.
Scale 4
Guide to Model Customization
Also, looking at the leaderboard of benchmark datasets like the infamous SpiderDev makes it
appear that the problem is pretty much solved:
Scale 5
Guide to Model Customization
However, in many cases, real-world databases will have hundreds of columns with
custom names. The schema might not fit into the context window of the prompt
and even if it does, the model still does not understand the meaning of the column
names and how they relate to each other. For example, does a “date” column in a
vehicle sales database record the time the sale was recorded or the time the sale
took place? A very robust understanding of typical business terms for the given
databases and the column contents is essential, especially to correctly apply aggre-
gation and window functions. The relationships between multiple table schemas
are also difficult to convey in the prompt, but this is required for slightly more
complex operations like JOINs.
Scale 6
Guide to Model Customization
→ Example:
Insure an education model will
not provide actual answers to
students, just give hints
MAXIMIZ E ROI
→ Example:
Fine-tuned GPT-3.5 can easily
outperform GPT-4 at natural
language to SQL task
Scale 7
Guide to Model Customization
Fine-tuning is an effective way to improve the specificity of a certain skill that the
model is capable of performing but has not yet mastered. It can be used to teach a
model highly specific terms and instructions and improve its capabilities. A good
way to figure out if fine-tuning is going to work is by experimenting with prompt
engineering. As a rule of thumb, if prompt engineering shows promising results,
then fine-tuning will likely be effective for the given task.
Conversely, fine-tuning is not a good way to add new data or knowledge, such as
the database schema or even detailed explanations of columns and their relation-
ships to the model. Instead, this type of context information is best infused into a
model using Retrieval Augmented Generation or RAG (see part 2 of this guide for
a deep dive on RAG). Hence, a real-world solution will likely have to include both
fine-tuning and RAG to achieve acceptable results.
Scale 8
Guide to Model Customization
The Spider dataset is the standard benchmark for comparing natural language to
SQL models and methods. However, real-world enterprise SQL databases differ from
Spider in both size and complexity. Whereas 90% of the databases in Spider’s Train
and Dev datasets contain fewer than 50 columns, enterprise databases contain up to
and beyond 1000 unique columns. This discrepancy renders the common approach of
providing the entire database schema in the prompt infeasible for real-world use cases,
given token limit constraints and the “lost in the middle problem.”
As many SQL queries require only a fraction of all columns, we solve the above
dilemma with a fine-tuned retrieval system, which retrieves those database features
relevant to a user’s question. Given a customer’s database schema, we can fine-tune a
model to learn the unique ways customers refer to their database. Once the embedding
model is deployed into the backend of our Enterprise Generative AI Platform (EGP),
we can easily create, populate, and query the retrieval Knowledge Base.
Scale 9
Guide to Model Customization
upload = client.knowledge_bases().uploads().create_remote_upload(
knowledge_base=knowledge_base,
data_source_config=data_source
chunking_strategy_config=chunking_strategy_config
)
Scale 10
Guide to Model Customization
Scale 11
Guide to Model Customization
Scale 12
Guide to Model Customization
Scale 13
Guide to Model Customization
We can see that the fine-tuned model displayed on the right-hand side not
only interprets the terms for the natural language query correctly but also
applies a better and more efficient query structure, using a subquery instead of
a left join.
What’s next?
Our solution is not quite on top of the SpiderDev leaderboard, as most of the
submitted architectures rely on prompt engineering and data pre-processing
that is extremely tailored to this benchmark. However, our model does achieve
top 5 performance and crucially demonstrates a comparable accuracy even
when deployed in much more complex, real-world contexts and databases.
If you’re interested in using our Text2SQL model for your business use case or
want to learn more about our solutions in fine-tuning and RAG, book a demo
with Scale here
Scale 14
Guide to Model Customization
Retrieval Augmented
Generation Deep Dive
Retrieval Augmented Generation (RAG) has captured the attention of the Gen-
erative AI community because it significantly extends the available use cases and
applications of Generative AI models. RAG means to retrieve and then inject
external information into the prompt of an LLM call so that the model can
generate outputs with specific context from unique data sources. In this blog post,
we’re going to discuss why RAG is important, how it works, and how to improve its
usefulness for enterprise use cases.
Scale 15
Guide to Model Customization
In practice, it is not possible to continuously retrain the model on all the latest
data before calling it. We want the model to “apply” its knowledge to new data.
A good analogy would be teaching a law school student how to assess whether a
transaction document has a tax deed. There is specific tax law course content that
the student learns during the curriculum. However, after graduating, the student
will encounter entirely new cases in practice. The student needs to apply her
learned skills to assess tax deeds. But if she wants to know how many of the law
firm’s transaction documents had a tax deed, she needs to have access to all the
relevant documents and information of this specific case and go through them.
The course content alone will not suffice, as it would be impossible to include all
the cases in the university course.
The easiest solution to this problem would be to include all the required new
information in the prompt when calling the model. In the image above, this
would mean taking the text of all the transaction documents, feeding it into the
model, and asking, “How many of these mention a tax deed?”. However, there
Scale 16
Guide to Model Customization
is a problem: the model has a limited “context window.” This means it can only
take a maximum size of input words for its prompt due to computational con-
straints. While the latest models like GPT-4 can take up to 32,000 tokens (about 50
pages of text), this is still not enough to answer questions about large volumes of
documents or data, not even enough for the sample question above.
The solution involves introducing an additional step before calling the LLM: re-
trieving the most relevant information from a so-called “knowledge base” and then
adding this retrieved data to the prompt.
RAG on the other hand is inserting data into the prompt (context window) of
the model during inference. The steps here involve chunking, embedding, vector
stores, reranking, and more all of which we will dive deeper into now.
Scale 17
Guide to Model Customization
Scale 18
Guide to Model Customization
Pre-Processing
The first step in RAG is to transform the raw data that should be used for retrieval
into a format so that it can be effectively searched based on the user prompt. For
example, the relevant database might be a collection of 10,000 legal transaction
documents. The output of the pre-processing step is often called a “knowledge
base.” This is a combination of a vector store with searchable embeddings and a
set of metadata that is associated with these embeddings.
After reading in the raw text (which for PDF documents can require additional
OCR or language models), the first step is to split or “chunk” the raw text into
smaller pieces. The most basic way is just by a fixed number of tokens. The ideal
size of chunks is determined by the individual use case and input data and is to be
iteratively optimized based on the system accuracy (see below).
An important but often overlooked next step is to extract metadata from the
chunks and overall input documents. This could involve steps like generating
summaries of each chunk, extracting references, headings, document metadata
(like timestamps, etc.), and more. The metadata depends on the specific use case.
The next step is to generate embeddings for the chunks and store them in a vector
store. The embedding model and vector used vary widely and depend on the
relevant business requirements. For example, some embedding models or vector
stores can be hosted locally, while others are hosted externally. The embedding
model can also be fine-tuned for specific types of text, leading to higher overall
retrieval performance.
Scale 19
Guide to Model Customization
Lastly, we store the metadata in a relevant metadata store. This can either be
directly with the chunks in the vector store (if it allows metadata storage) or in a
separate relational database.
When accessing data from external applications and sources, the pre-processing
step is often automated with so-called data connectors. A data connector would
first authenticate to a source of data like Google Drive and then asynchronously
execute the pre-processing steps on a regular basis to ensure the knowledge base
stays in sync with the data stored in the external resource.
The following is an example of how to create a knowledge base using Scale’s EGP
APIs and link it to a Google Drive data connector:
Scale 20
Guide to Model Customization
Retrieval
The core retrieval process has three elements: translating the user query into em-
beddings (question encoding), similarity search, and reranking.
To search the newly stored embeddings in the vector store, we need to translate
the question the user has asked in natural language (e.g., “How many litigations
did we have in March?”) into embedding space as well. This is usually done using
either the same model used to embed the chunks during the pre-processing step,
or another model trained alongside that model.
Next, we can query the vector store by running a similarity search between the
question embeddings and all the embeddings stored in the vector store. The
result is a list of chunk ids that are ordered by similarity score from most to least
relevant. While a k nearest neighbor (kNN) search is the most common technique,
some vector stores offer more advanced search options for higher relevance
results.
We also have to specify how many results (chunks) we want to get from the vector
store. A higher number of results makes it more likely that the correct piece of
information is contained in the results, but the LLM context window is limited,
which puts a constraint on the number of chunks we can return. Additionally,
dealing with a very large number of retrieved chunks can increase the time the full
generation pipeline takes due to various downstream pieces such as reranking,
filtering, and more. Furthermore, more results make it more difficult to rank the
chunks in a way that the most relevant is up top.
Often, we would like to use a stronger reranking model than kNN vector search,
such as a cross-encoder, but it would be too expensive to run on the thousands or
millions of chunks present in the vector store. In this case, we employ a two-stage
reranking process: first, we narrow down the chunks to the top K (e.g. K=500)
using this vector store retrieval, and then we rerank only the retrieved chunks with
the second-stage reranking model. This allows us to search the whole vector store
Scale 21
Guide to Model Customization
In our experience, the reranking step is a crucial element in the retrieval process
which ensures a high-quality LLM response. Hence, more and more companies
are giving special attention to reranking with techniques like maximum marginal
relevance (MMR) gaining more attention. Cohere recently even published a model
specifically for reranking. At Scale, we have also observed that fine-tuning the
reranking model on the specific dataset that is used for retrieval can also dramati-
cally improve results.
In many RAG solutions, the entire retrieval step is abstracted in a single API call to
an existing knowledge base, with lower-level APIs available to further customize
the individual sub-steps and configurations.
url = f”https://api.egp.scale.com/v2/knowledge-bases/{knowledge_base_id}/query”
payload = {
“include_embeddings”: True,
“query”: “Among all our transaction documents this month, how many mention a tax deed?”,
“top_k”: 10
}
response = requests.post(url, json=payload, headers=headers)
Scale 22
Guide to Model Customization
Reasoning
The last step is to fetch the relevant chunks based on the IDs returned by the
vector store and compose the final prompt for the language model based on the
initial use question and the retrieved content. One of the challenges here is often
to ensure that the language model is answering the user question only or mostly
based on the information that was found during retrieval. This includes making
sure that the model answers “I don’t know” if the relevant information is not
found instead of making up a response. For example, when asking a question about
a specific litigation case, but the document is not in the knowledge base, we want
to ensure that the LLM answers “I don’t have information about this case” instead
of making up a response based on similar documents. One way to achieve this is
accurate prompt engineering, outlining strict rules for the LLM when answering
questions based on retrieved context. However, this is often not enough, especially
for smaller open source models that have been trained on significantly less data
compared to models like GPT-4.
For example, at Scale we have fine-tuned an MPT-7B model for RAG using this
Scale 23
Guide to Model Customization
Scale 24
Guide to Model Customization
Behind the scenes, between prompt and response we can follow all the previously
discussed steps: ingestion, embedding, metadata storage, retrieval, reranking and
LLM reasoning, working in concert to generate the final answer.
Scale 25
Guide to Model Customization
We want to look at two techniques Scale is using for internal and external RAG
benchmarking: Single Word Benchmark (SWB) and Span Evaluation Benchmark
(SEB).
Scale 26
Guide to Model Customization
fault for wrong answers. Hence, it is often important to pair such end to end evalu-
ation with a retrieval-only evaluation benchmark.
Scale 27
Guide to Model Customization
• Vary the size of chunks based on the document type and content
RAG Fine-Tuning
• This is most important when using smaller models and when the
context is highly domain-specific
Scale 28
Guide to Model Customization
What’s next?
Retrieval Augmented Generation is a fast-moving field, and this overview does
not fully capture what is possible with RAG. At Scale, we are helping leading
enterprises customize large language models with fine-tuning and RAG using the
Scale GenAI Platform to help them unlock their most important use cases. We are
building highly effective RAG systems leveraging our comprehensive set of EGP
APIs, some of which were shown in this article. To learn more about using these
APIs for your project, book a demo here.
About Scale
Transform your data into customized enterprise-ready Generative AI Applications.
Accelerate and scale your organization’s Generative AI journey with the full-stack
platform to build, test, and deploy enterprise-ready applications, customized with
your own data.
Scale 29