[go: up one dir, main page]

0% found this document useful (0 votes)
48 views29 pages

Scale Guide To Model Customization

Uploaded by

Kunal Ojha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views29 pages

Scale Guide To Model Customization

Uploaded by

Kunal Ojha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Guide to LLM

Customization
A Deep Dive into Fine-Tuning and
Retrieval Augmented Generation

1
Guide to Model Customization

Table of Contents
Data Analysis Use Case: Fine-Tuning GPT-4 3
for Text2SQL

Custom Copilots Use Case: Retrieval 15


Augmented Generation Deep Dive

About Scale 29

Introduction
With Scale GenAI Platform, customers use their proprietary data to customize GenAI
applications. Our customers are building use cases like:

• Highly customized wealth management copilots that make advisors more


effective by helping them tap into their knowledge bases quickly and accurately.

• Data analysis and business intelligence applications using fine-tuned LLMs for
Text2SQL to make analysts more efficient and embed a culture of data-driven
decision-making.

To build accurate and effective applications, these use cases require high-quality pro-
prietary data, and expert fine-tuning and retrieval augmented generation (RAG). In this
guide we will go in-depth into what it takes to implement this fine-tuning and RAG to
give you a better understanding of how to use these tools for your own Generative AI
use cases

Scale 2
Guide to Model Customization

Part 1: Fine-Tuning
GPT-4 for Text2SQL
Data Analysis Use Case: Fine-Tuning GPT-4 for
Text2SQL
Our machine learning team at Scale
has recently fine-tuned GPT-4 to
achieve state-of-the-art performance
(84% accuracy) for generalized text-
to-SQL translation on one of the
most popular benchmark datasets,
the SpiderDev Set. In this blog post,
we will discuss why text2sql is an
important use case, why it is hard in
practice, where fine-tuning can help,
how we implemented a real-world
solution, and finally, what our results were.

Why is Text2SQL important?


Most business decisions today are data-driven decisions. This means that organiza-
tions collect, aggregate, and interpret large amounts of available information about their
business or the market environment with a set of tools and processes that are often sum-
marized as business intelligence or BI. However, obtaining the relevant pieces of informa-
tion from the vast amounts of available data typically requires analytical expertise (SQL or
similar) and knowledge of the relevant databases, dashboards, or related tools. This often

Scale 3
Guide to Model Customization

creates a massive bottleneck and reliance on data analysts to build these tools, which
then proliferate and become hard to navigate. Multi-billion dollar industries have
emerged to provide generalist or highly specialized analytics tools to bridge this gap.

The advent of large


language models is poised
to change this paradigm
with the ability to generate
SQL queries directly from
natural language questions
such as “How many
vehicles did we sell last
year?”. Building generative
models that can robustly
generate SQL queries for
any given set of databases
hence has the potential to
disrupt an entire industry
and truly democratize
access to structured data
at large.

Scale 4
Guide to Model Customization

Why are LLMs still bad at SQL in the real world?


Running some basic tests using models like OpenAI’s ChatGPT provides very promising results:

Also, looking at the leaderboard of benchmark datasets like the infamous SpiderDev makes it
appear that the problem is pretty much solved:

Scale 5
Guide to Model Customization

However, despite the impressive code generation capabilities of state-of-the-art


language models like GPT-4, they are not immediately useful for generating queries
that run on custom, real-world databases. First of all, the LLMs do not know the
schema of the databases in question out of the box. The most obvious solution is
to provide the schema to the model in addition to the prompt.

However, in many cases, real-world databases will have hundreds of columns with
custom names. The schema might not fit into the context window of the prompt
and even if it does, the model still does not understand the meaning of the column
names and how they relate to each other. For example, does a “date” column in a
vehicle sales database record the time the sale was recorded or the time the sale
took place? A very robust understanding of typical business terms for the given
databases and the column contents is essential, especially to correctly apply aggre-
gation and window functions. The relationships between multiple table schemas
are also difficult to convey in the prompt, but this is required for slightly more
complex operations like JOINs.

Scale 6
Guide to Model Customization

How can fine-tuning and retrieval help to


resolve these challenges?
At Scale, we are working with enterprise customers
across many industries to build customized Gener- INCREAS ED PERFORMANCE
ative AI solutions for their respective use cases. In → Teach the model highly domain-
most of these applications, we fine-tune an un- specific knowledge or skills
derlying base model to solve the relevant business
→ Example:
problems at the required accuracy level. Fine-tuning Fine-tune a model on specific
subdomain of patent law based
not only can improve the performance of a model on previous company
for a given task, but can also drive model safety and
alignment, ensuring a certain tone and behavior. It is
also a good way to improve ROI as it can be used to
teach smaller (and cheaper) models a very specific IMPROV ED CONFIDENCE
skill and eventually even outperform much bigger,
→ Align the behavior of the model
generalized models at this task. with specific objectives

→ Example:
Insure an education model will
not provide actual answers to
students, just give hints

MAXIMIZ E ROI

→ Save costs, tokens and latency


by fine-tuning a smaller model

→ Example:
Fine-tuned GPT-3.5 can easily
outperform GPT-4 at natural
language to SQL task

Scale 7
Guide to Model Customization

Fine-tuning is an effective way to improve the specificity of a certain skill that the
model is capable of performing but has not yet mastered. It can be used to teach a
model highly specific terms and instructions and improve its capabilities. A good
way to figure out if fine-tuning is going to work is by experimenting with prompt
engineering. As a rule of thumb, if prompt engineering shows promising results,
then fine-tuning will likely be effective for the given task.

G OOD FOR NOT GOOD FOR

✓ Emphasizing knowledge that ✕ Adding new knowledge to the


already exists in the model base model

✓ Customizing the structure or tone ✕ Quickly iterating on a new use-


of responses case

✓ Teaching a model very complex


instructions

Conversely, fine-tuning is not a good way to add new data or knowledge, such as
the database schema or even detailed explanations of columns and their relation-
ships to the model. Instead, this type of context information is best infused into a
model using Retrieval Augmented Generation or RAG (see part 2 of this guide for
a deep dive on RAG). Hence, a real-world solution will likely have to include both
fine-tuning and RAG to achieve acceptable results.

How did we implement our solution?


Our solution reflects a system intended for enterprise customers. Accordingly,
we benchmarked multiple techniques that could be used for real-world use cases
against a baseline:

• Full database schema with off-the-shelf model (Baseline)


• Schema RAG with In-context-learning (ICL)
• Fine Tuned model against Schema RAG and ICL

We’ll now walk through each of these in more detail.

Scale 8
Guide to Model Customization

Database Schema Retrieval

The Spider dataset is the standard benchmark for comparing natural language to
SQL models and methods. However, real-world enterprise SQL databases differ from
Spider in both size and complexity. Whereas 90% of the databases in Spider’s Train
and Dev datasets contain fewer than 50 columns, enterprise databases contain up to
and beyond 1000 unique columns. This discrepancy renders the common approach of
providing the entire database schema in the prompt infeasible for real-world use cases,
given token limit constraints and the “lost in the middle problem.”

As many SQL queries require only a fraction of all columns, we solve the above
dilemma with a fine-tuned retrieval system, which retrieves those database features
relevant to a user’s question. Given a customer’s database schema, we can fine-tune a
model to learn the unique ways customers refer to their database. Once the embedding
model is deployed into the backend of our Enterprise Generative AI Platform (EGP),
we can easily create, populate, and query the retrieval Knowledge Base.

Scale 9
Guide to Model Customization

from scale_egp.sdk.client import EGPClient


from scale_egp.sdk.models import S3DataSourceConfig, CharacterChunkingStrat-
egyConfig
In context learning
# Instantiate client
With the schema retrieval providing a
client = EGPClient() lower token count, we can supply more
# This is Pseudocode
relevant context to address misalign-
embedding_model = client.models().create( ment between business terms and the
model_type=”embedding”,
model_template=”sentence_transformers_embedding_model”,
database schema. On initial deploy-
model_template_config={ ment, we collect terms and information
weights_uri: “s3://model_weights/presigned_url”
}
relevant to SQL logic. For example, “the
) term MPG refers to miles per gallon”, or
# Create a Knowledge Base
“stock means filter on asset_type=’eq-
knowledge_base = client.knowledge_bases().create( uity’”. This initial retrieval mechanism
name=”my-knowledge-base”
embedding_model_id=embedding_model.id,
is the primer in the engine of our data
) flywheel. As users then interact with the
# Configure knowledge base and uplaod
tool, we collect real-world samples that
data_source = S3DataSourceConfig( can be retrieved for in-context learning.
s3_bucket=”my-bucket”,
s3_prefix=”my-schema”,
With this additional corpus, we provide
) queries from similar questions in the
chunking_strategy_config = CharacterChunkingStrategyConfig()
context window provided to the LLM.

upload = client.knowledge_bases().uploads().create_remote_upload(
knowledge_base=knowledge_base,
data_source_config=data_source
chunking_strategy_config=chunking_strategy_config
)

# Query schema for a given question


query = “What was last month’s total expense for service provider X?”
retrieved_schema = client.knowledge_bases().query(
knowledge_base=knowledge_base,
query=query,
top_k=20,
)

Scale 10
Guide to Model Customization

# Create a custom LLM


LLM_MODEL = client.models().create( Fine Tuning
model_type=”llm”,
model_template=”llm_engine_model”,
model_template_config={
Tying together prompt engineering
weights_uri: “s3://model_weights/presigned_url” and the above retrieval mechanisms,
}
)
we optimize the density of information
available to the out-of-the-box model. To
class Text2SQLApplication:
get the final accuracy boost, we turn to
fine-tuning. For customers, this means
name = “Text2SQL”
description = “Natural language to SQL queries for My Company”
collecting real user data and fine-tuning
llm_model = LLM_MODEL.id an LLM to learn the nuances of their
def __init__(self, schema_knowledge_base_id: str,
data and terminology. The fine-tuning
icl_knowledge_base_id: str): not only fills in gaps in the retrieval data
self.schema_kb_id = schema_knowledge_base_id
self.icl_kb_id = icl_knowledge_base_id
but hones in on trends or relationships
unknown to users. Once the model is
def create_prompt(self, question: str, schema_chunks: List[str], icl_
chunks:
complete, each step is seamlessly inte-
List[str]) -> str: grated with the EGP SDK.
...
return rag_prompt
To prove out the viability of this system,
def generate(self, question: str, schema_k: int, icl_k: int) -> str:
we leveraged OpenAI’s Fine-tuning
# Retreive relevant schema information API to train GPT-3.5 and GPT-4. For
schema_chunks = client.knowledge_bases().query(
knowledge_base=self.schema_kb_id,
each question-query pair in the Spider
query=question, training set, we used the above meth-
top_k=schema_k
)
odology to create prompts with at most
20 features and five relevant question,
# Retrieve in-context-learning samples
icl_chunks = client.knowledge_bases().query(
SQL query pairs. For each generated
knowledge_base=self.icl_kb_id, prompt, we simply set the target to the
query=question,
top_k=icl_k
respective SQL query. We use the same
) approach to generate a validation set of
# Generate prompt from retrieval
question, query pairs from the Spider
rag_prompt = self.create_prompt(question, schema_chunks, icl_chunks) dev set. After packaging up the train and
# Generate SQL
validation sets into files, uploading the
generate_response = client.completions().create( data, fine tuning models and generating
model=self.llm_model,
prompt=rag_prompt
validation predictions was completely
) handled by OpenAI’s robust APIs.
return generate_response.completion.text

Scale 11
Guide to Model Customization

Validating Against Spider


We validate the system and benchmark performance using the Spider dataset.
For schema retrieval, we fine-tuned a Sentence Transformer to match questions
with their relevant database columns and achieved 97% recall@20. With respect
to in-context learning, we leverage an out-of-the-box Sentence Transformer. For
both GPT-3.5 and GPT-4, we measure the execution accuracy of generated SQL
queries for a baseline (prompt with the entire database), RAG (schema retrieval
and in-context-learning), and finally the respective model fine-tuned on the RAG
prompts. We observe performance improvements at each stage, which results in a
best execution accuracy of 83.6% on the Spider Dev set. Thus, we not only achieve
state-of-the-art level performance, but have a system optimized to provide en-
terprise customers with the best commercially available natural language to SQL
available.

What are the results?


Baseline: Entire Schema in the Prompt
With the Spider validation data, we can calculate a baseline execution accuracy.
For each question-query pair, we pack the entire schema of the respective SQL
database (respective to the query and question) into the prompt and ask GPT-4 to
answer the question given that context. From this simple baseline, GPT-4 achieves
an execution accuracy of 70% (a D+).

Scale 12
Guide to Model Customization

Adding Schema RAG and In-context learning


(ICL)
Using a structured way to find the relevant parts of the DB schema with RAG
shows consistent improvements in performance across models. For GPT-3.5, we
see 7 ppts improvement from 60% to 66% and for GPT-4 a slightly smaller bump
from 70% to 73%.

Adding Schema RAG, ICL, and fine-tuning


When additionally fine-tuning the model with specific prompt-response pairs, we
see consistent further performance improvements both for GPT-3.5 and GPT-4.
The final, fine-tuned GPT-4 with schema RAG and ICL achieves 84% accuracy on
Spider, up from the 70% in the baseline version, which marks an impressive 14
ppts improvement. For GPT-3.5 the increase is even more pronounced, reaching
82% (almost as good as GPT-4) with RAG and fine-tuning, which is up 22 ppts
from the baseline of using only prompt
engineering. For GPT-3.5, the biggest
increase is from fine-tuning itself,
pushing performance from 66% to 82%
with this technique alone.

The figure on the right is a comparison


of the performance across the three
different approaches for both GPT-3.5
and GPT-4.

Let’s look at a practical query example


to show the difference between using
GPT-4 out of the box versus the RAG
and fine-tuned version.

Scale 13
Guide to Model Customization

We can see that the fine-tuned model displayed on the right-hand side not
only interprets the terms for the natural language query correctly but also
applies a better and more efficient query structure, using a subquery instead of
a left join.

What’s next?
Our solution is not quite on top of the SpiderDev leaderboard, as most of the
submitted architectures rely on prompt engineering and data pre-processing
that is extremely tailored to this benchmark. However, our model does achieve
top 5 performance and crucially demonstrates a comparable accuracy even
when deployed in much more complex, real-world contexts and databases.

If you’re interested in using our Text2SQL model for your business use case or
want to learn more about our solutions in fine-tuning and RAG, book a demo
with Scale here

Scale 14
Guide to Model Customization

CUSTOM COPILOTS USE CASE:

Retrieval Augmented
Generation Deep Dive
Retrieval Augmented Generation (RAG) has captured the attention of the Gen-
erative AI community because it significantly extends the available use cases and
applications of Generative AI models. RAG means to retrieve and then inject
external information into the prompt of an LLM call so that the model can
generate outputs with specific context from unique data sources. In this blog post,
we’re going to discuss why RAG is important, how it works, and how to improve its
usefulness for enterprise use cases.

Scale 15
Guide to Model Customization

What Is RAG and Why Does It


Matter?
LLMs are trained on many billions of tokens, providing
extensive knowledge and powerful reasoning capabilities.
In many cases, Enterprises may then opt to fine-tune an
LLM with highly specialized data to further enhance the
model’s knowledge and capabilities for a specific use case
or domain. However, after the training and fine-tuning are
complete, the “knowledge” of the model, i.e., the data it
can use to generate responses, is fixed. Hence, when asking
the model a question about data or documents it has not
seen before, it cannot answer, such as in this example
conversation:

In practice, it is not possible to continuously retrain the model on all the latest
data before calling it. We want the model to “apply” its knowledge to new data.
A good analogy would be teaching a law school student how to assess whether a
transaction document has a tax deed. There is specific tax law course content that
the student learns during the curriculum. However, after graduating, the student
will encounter entirely new cases in practice. The student needs to apply her
learned skills to assess tax deeds. But if she wants to know how many of the law
firm’s transaction documents had a tax deed, she needs to have access to all the
relevant documents and information of this specific case and go through them.
The course content alone will not suffice, as it would be impossible to include all
the cases in the university course.

The easiest solution to this problem would be to include all the required new
information in the prompt when calling the model. In the image above, this
would mean taking the text of all the transaction documents, feeding it into the
model, and asking, “How many of these mention a tax deed?”. However, there

Scale 16
Guide to Model Customization

is a problem: the model has a limited “context window.” This means it can only
take a maximum size of input words for its prompt due to computational con-
straints. While the latest models like GPT-4 can take up to 32,000 tokens (about 50
pages of text), this is still not enough to answer questions about large volumes of
documents or data, not even enough for the sample question above.

The solution involves introducing an additional step before calling the LLM: re-
trieving the most relevant information from a so-called “knowledge base” and then
adding this retrieved data to the prompt.

RAG vs. Fine-Tuning


Based on RAG’s success in providing many practical LLM applications, a debate
has emerged about when to use RAG and Fine-Tuning. This is a false dichotomy
because these two techniques are complementary. Fine-tuning is for changing and
adjusting the model’s behavior, i.e., “teaching” the model new skills like “writing
patent litigation claims.” RAG is about providing additional context to the model
at the time it is called. The two techniques are often needed in concert. For
example, a tax lawyer needs both specialized training (fine-tuning) AND access to
the relevant case documents (RAG) in order to assess their content.

Fine-tuning involves changing the actual weights of a pre-trained model, using


highly domain- and task-specific data. There are multiple techniques, including
instruction fine-tuning (prompt-response pairs) and RLHF. The most important
part of fine-tuning is developing a high-quality dataset in the necessary domain
and with data relevant to the use cases for which the model will be used.

RAG on the other hand is inserting data into the prompt (context window) of
the model during inference. The steps here involve chunking, embedding, vector
stores, reranking, and more all of which we will dive deeper into now.

Scale 17
Guide to Model Customization

How Does A RAG System Work?


A RAG implementation typically includes three primary components: pre-process-
ing, retrieval, and reasoning, each of which have further sub-components. Pre-pro-
cessing takes the raw data that should be used by the LLM and transforms it into
a format that can be used for retrieval at inference time, or when the published
model is asked to produce an output such as an answer to a question. This process
involves data connectors, chunking, chunk-processing, metadata extraction,
embedding generation, and storing embeddings in a vector store. Retrieval is the
core process of searching a vector store for the embeddings most similar to a
user query and then re-ranking the results for relevance. Finally, reasoning is the
actual call to the language model combining the original prompt and the retrieved
context, generating the answer that the user will eventually read.

Let’s look at each of the components in a bit more detail.

Scale 18
Guide to Model Customization

Pre-Processing
The first step in RAG is to transform the raw data that should be used for retrieval
into a format so that it can be effectively searched based on the user prompt. For
example, the relevant database might be a collection of 10,000 legal transaction
documents. The output of the pre-processing step is often called a “knowledge
base.” This is a combination of a vector store with searchable embeddings and a
set of metadata that is associated with these embeddings.

After reading in the raw text (which for PDF documents can require additional
OCR or language models), the first step is to split or “chunk” the raw text into
smaller pieces. The most basic way is just by a fixed number of tokens. The ideal
size of chunks is determined by the individual use case and input data and is to be
iteratively optimized based on the system accuracy (see below).

Next, it is often required to “sanitize” the chunks to improve their semantic


meaning. For example, in legal contract documents, we want to ensure that the
chunks always start and finish with clauses or sub-sections. When data is extracted
from tables, HTML documents, or similar, sanitation might involve removing tags
or other unnecessary artifacts.

An important but often overlooked next step is to extract metadata from the
chunks and overall input documents. This could involve steps like generating
summaries of each chunk, extracting references, headings, document metadata
(like timestamps, etc.), and more. The metadata depends on the specific use case.

The next step is to generate embeddings for the chunks and store them in a vector
store. The embedding model and vector used vary widely and depend on the
relevant business requirements. For example, some embedding models or vector
stores can be hosted locally, while others are hosted externally. The embedding
model can also be fine-tuned for specific types of text, leading to higher overall
retrieval performance.

Scale 19
Guide to Model Customization

Lastly, we store the metadata in a relevant metadata store. This can either be
directly with the chunks in the vector store (if it allows metadata storage) or in a
separate relational database.

When accessing data from external applications and sources, the pre-processing
step is often automated with so-called data connectors. A data connector would
first authenticate to a source of data like Google Drive and then asynchronously
execute the pre-processing steps on a regular basis to ensure the knowledge base
stays in sync with the data stored in the external resource.

The following is an example of how to create a knowledge base using Scale’s EGP
APIs and link it to a Google Drive data connector:

# create a new knowledge base


url = “https://api.egp.scale.com/v2/knowledge-bases”
payload = {
“embedding_config”: { “embedding_model”: “sentence-transformers/all-
MiniLM-L12-v2” },
“knowledge_base_name”: “example_knowledge_base”
}
response = requests.post(url, json=payload, headers=headers)
knowledge_base_id = response.json()[“data”][“knowledge_base_id”]
# upload documents to the knowledge base using a Google Drive data connector
url = f”https://api.egp.scale.com/v2/knowledge-bases/{knowledge_base_id}/
uploads”
payload = {
“data_source_config”: {
“source”: “GoogleDrive”,
“drive_id”: “DRIVE_ID_GOES_HERE”
},
“chunking_strategy_config”: {
“strategy”: “character”,
“separator”: “\n\n”,
“chunk_size”: 1000,
“chunk_overlap”: 200
}
}
response = requests.post(url, json=payload, headers=headers)

Scale 20
Guide to Model Customization

Retrieval
The core retrieval process has three elements: translating the user query into em-
beddings (question encoding), similarity search, and reranking.

To search the newly stored embeddings in the vector store, we need to translate
the question the user has asked in natural language (e.g., “How many litigations
did we have in March?”) into embedding space as well. This is usually done using
either the same model used to embed the chunks during the pre-processing step,
or another model trained alongside that model.

Next, we can query the vector store by running a similarity search between the
question embeddings and all the embeddings stored in the vector store. The
result is a list of chunk ids that are ordered by similarity score from most to least
relevant. While a k nearest neighbor (kNN) search is the most common technique,
some vector stores offer more advanced search options for higher relevance
results.

We also have to specify how many results (chunks) we want to get from the vector
store. A higher number of results makes it more likely that the correct piece of
information is contained in the results, but the LLM context window is limited,
which puts a constraint on the number of chunks we can return. Additionally,
dealing with a very large number of retrieved chunks can increase the time the full
generation pipeline takes due to various downstream pieces such as reranking,
filtering, and more. Furthermore, more results make it more difficult to rank the
chunks in a way that the most relevant is up top.

Often, we would like to use a stronger reranking model than kNN vector search,
such as a cross-encoder, but it would be too expensive to run on the thousands or
millions of chunks present in the vector store. In this case, we employ a two-stage
reranking process: first, we narrow down the chunks to the top K (e.g. K=500)
using this vector store retrieval, and then we rerank only the retrieved chunks with
the second-stage reranking model. This allows us to search the whole vector store

Scale 21
Guide to Model Customization

while also getting the benefits of the reranker.

In our experience, the reranking step is a crucial element in the retrieval process
which ensures a high-quality LLM response. Hence, more and more companies
are giving special attention to reranking with techniques like maximum marginal
relevance (MMR) gaining more attention. Cohere recently even published a model
specifically for reranking. At Scale, we have also observed that fine-tuning the
reranking model on the specific dataset that is used for retrieval can also dramati-
cally improve results.

In many RAG solutions, the entire retrieval step is abstracted in a single API call to
an existing knowledge base, with lower-level APIs available to further customize
the individual sub-steps and configurations.

url = f”https://api.egp.scale.com/v2/knowledge-bases/{knowledge_base_id}/query”
payload = {
“include_embeddings”: True,
“query”: “Among all our transaction documents this month, how many mention a tax deed?”,
“top_k”: 10
}
response = requests.post(url, json=payload, headers=headers)

Another important step that is often performed during retrieval is metadata


filtering. This includes matching certain names, keywords, dates, etc. in the user
query and using the previously extracted metadata per chunk in order to filter out
irrelevant pieces of information. For example, if the user asked for contracts for
the month of March, we can filter out all the chunks that are from documents of a
different month, no matter their semantic relevance. Performing the filtering on
the results of the vector store search is called post-filtering. Another approach is to
do pre-filtering, which would mean applying the filter of metadata already during
or before the similarity search. This can improve the speed of the similarity search
for large knowledge bases; however, not all vector stores support this technique.

Scale 22
Guide to Model Customization

At Scale, we observe that metadata filtering, in connection with high-quality


metadata extraction during pre-processing, is typically one of the most critical
operations that drive retrieval accuracy. It is particularly critical in cases where the
RAG system has access to many documents that are very semantically similar, like
1000s of transaction documents in a law firm.

Reasoning
The last step is to fetch the relevant chunks based on the IDs returned by the
vector store and compose the final prompt for the language model based on the
initial use question and the retrieved content. One of the challenges here is often
to ensure that the language model is answering the user question only or mostly
based on the information that was found during retrieval. This includes making
sure that the model answers “I don’t know” if the relevant information is not
found instead of making up a response. For example, when asking a question about
a specific litigation case, but the document is not in the knowledge base, we want
to ensure that the LLM answers “I don’t have information about this case” instead
of making up a response based on similar documents. One way to achieve this is
accurate prompt engineering, outlining strict rules for the LLM when answering
questions based on retrieved context. However, this is often not enough, especially
for smaller open source models that have been trained on significantly less data
compared to models like GPT-4.

An effective way to alleviate this problem is to perform RAG fine-tuning. RAG


fine-tuning is a specialized case of instruction fine-tuning, where the model is
trained on prompt-response pairs with an added bit of context or data for each
pair. This conditions the model to “expect” retrieved context when answering
questions and makes it much less likely the model will hallucinate when used in a
RAG system.

For example, at Scale we have fine-tuned an MPT-7B model for RAG using this

Scale 23
Guide to Model Customization

technique and have observed dramatic improvements on hallucination and


“helpful answers”.

In many RAG systems, the reasoning step is followed by a process of generating


references or citations for the model response. This means that one or multiple
sentences in the model response include a reference to the exact chunk that was
used to generate this response. There are multiple techniques to generate these
citations like similarity matching or even having another LLM generate these
matches.

To put everything together, let’s go back to our earlier example.


When asking the retrieval augmented LLM the question “How
many of our transaction documents last month included a tax
deed, we get a very cohesive answer, including citations:

Scale 24
Guide to Model Customization

Behind the scenes, between prompt and response we can follow all the previously
discussed steps: ingestion, embedding, metadata storage, retrieval, reranking and
LLM reasoning, working in concert to generate the final answer.

Scale 25
Guide to Model Customization

How to Measure Accuracy?


One of the most important questions for RAG systems used in any business
context is how accurate they are. However, evaluating RAG accuracy is both novel
and multi-dimensional, which is why it is often harder than expected.

At most companies, the chosen path of evaluation is to manually check (using


human judgment) a set of questions for which the ground truth answer is known
for correctness. In addition, it is often important to evaluate additional aspects, for
example, if the retrieved context was relevant for the answer. At Scale, we are able
to evaluate even very large sets of test questions using evaluation APIs and our
network of human experts. However, in order to enable rapid experimentation, it
is also advisable to use an automated way of evaluating RAG accuracy in addition
to human evaluation.

We want to look at two techniques Scale is using for internal and external RAG
benchmarking: Single Word Benchmark (SWB) and Span Evaluation Benchmark
(SEB).

Single Word Benchmark


Here we are using the documents in the knowledge base to formulate a set of
questions which have definite answers of one or a maximum of two words. For
example, “Who is the author of this document?” or “What was the EBITDA in
2022?”. We then record the ground truth answers for these short questions. During
evaluation, we can now use exact or fuzzy matching to compare the response (or
use another LLM to compare) of the RAG system for the evaluation questions with
the ground truth answers recorded. This technique evaluates the end-to-end per-
formance of the RAG system, which is beneficial to test the effectiveness of all the
components operating together. However, the evaluation does not give any insight
into which parts of the RAG system (e.g., pre-processing, retrieval, reasoning) is at

Scale 26
Guide to Model Customization

fault for wrong answers. Hence, it is often important to pair such end to end evalu-
ation with a retrieval-only evaluation benchmark.

Span Evaluation Benchmark


For this benchmark, we also use the set of documents in the database to formulate
test questions. These questions can be open-ended and do not need to have short
answers. Importantly, when formulating the question, we mark the exact span
of words (e.g., page 10, word 40-120), which contains the information required
to answer the question. This could be one or multiple spans. We call these
spans “ground truth context”. We do not care about the ground truth answer
for the question at this point. Now, during evaluation time, we are running the
RAG system on the collected test questions and are obtaining a set of reranked
chunks after the retrieval step. Instead of feeding these into the LLM, we are now
programmatically matching these chunks with the ground truth spans we have
collected. This allows us to automatically evaluate the accuracy of the pre-pro-
cessing and retrieval steps with high precision. Because the LLM generation is not
relevant in this evaluation it helps to isolate a problem within a RAG system.

How can we improve accuracy?


Now that we have built the RAG system and an evaluation method, the natural
next question is centered around “how can we improve the accuracy of our RAG
system”? Based on what we outlined above, at Scale we typically see three core
levels for improving retrieval performance, listed below with practical suggestions.

Scale 27
Guide to Model Customization

Chunking and Embedding

• Vary the size of chunks based on the document type and content

• Creating custom chunking logic based on the known document


format (e.g. joining tables across pages, separating chunks based on
section headers)

• Perform more advanced metadata extraction and chunk sanitization


during pre-processing

• Use higher-performance and/or fine-tuned embedding models

• Use higher performance vector stores

Reranking and Filtering

• Implement more advanced reranking algorithms or even fine-tune a


cross-encoder to increase the relevance of the top retrieved data

• Filter the retrieved chunks based on metadata like dates, named


entities and cross-references

RAG Fine-Tuning

• Fine-tune an LLM to always expect a certain type of context when


answering questions

• This is most important when using smaller models and when the
context is highly domain-specific

Scale 28
Guide to Model Customization

What’s next?
Retrieval Augmented Generation is a fast-moving field, and this overview does
not fully capture what is possible with RAG. At Scale, we are helping leading
enterprises customize large language models with fine-tuning and RAG using the
Scale GenAI Platform to help them unlock their most important use cases. We are
building highly effective RAG systems leveraging our comprehensive set of EGP
APIs, some of which were shown in this article. To learn more about using these
APIs for your project, book a demo here.

About Scale
Transform your data into customized enterprise-ready Generative AI Applications.

Accelerate and scale your organization’s Generative AI journey with the full-stack
platform to build, test, and deploy enterprise-ready applications, customized with
your own data.

Scale GenAI Platform is streamlined and centrally managed infrastructure, freeing


you to address more Generative AI use cases across your organization and accel-
erate your time to production. Customize, test, and deploy all major closed and
open-source foundation, embedding, and reranking models from OpenAI, Cohere,
Meta, and more. Build Generative AI applications for any use case including fine-
tuned customer service chatbots, data analysis applications using text2sql, or
custom copilots using retrieval augmented generation (RAG).

Scale GenAI Platform is trusted by leading enterprises including BCG, Global


Atlantic Financial Group, Howard Hughes, and more.

Visit scale.com/genai-platform to learn more

Scale 29

You might also like