[go: up one dir, main page]

0% found this document useful (0 votes)
14 views50 pages

Generative AI With Python

Python

Uploaded by

Archana V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views50 pages

Generative AI With Python

Python

Uploaded by

Archana V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Contents

Preface ....................................................................................................................................................... 15

1 Introduction to Generative AI 29

1.1 Introduction to Artificial Intelligence ........................................................................... 30


1.2 Pillars of Generative AI Advancement ......................................................................... 35
1.2.1 Computational Power ........................................................................................... 35
1.2.2 Model and Data Size .............................................................................................. 36
1.2.3 Investments .............................................................................................................. 37
1.2.4 Algorithmic Improvements ................................................................................. 37
1.3 Deep Learning ......................................................................................................................... 38

1.4 Narrow AI and General AI .................................................................................................. 40

1.5 Natural Language Processing Models .......................................................................... 42


1.5.1 NLP Tasks ................................................................................................................... 42
1.5.2 Architectures ............................................................................................................ 45
1.6 Large Language Models ...................................................................................................... 47
1.6.1 Training ...................................................................................................................... 47
1.6.2 Use Cases .................................................................................................................. 48
1.6.3 Limitations ................................................................................................................ 50
1.7 Large Multimodal Models .................................................................................................. 51

1.8 Generative AI Applications ............................................................................................... 52


1.8.1 Consumer .................................................................................................................. 53
1.8.2 Business ..................................................................................................................... 53
1.8.3 Prosumer ................................................................................................................... 54
1.9 Summary ................................................................................................................................... 54

2 Pretrained Models 57

2.1 Hugging Face ........................................................................................................................... 58

2.2 Coding: Text Summarization ........................................................................................... 60

7
Contents Contents

2.3 Exercise: Translation ............................................................................................................ 62


2.3.1 Task ............................................................................................................................. 63
2.3.2 Solution ...................................................................................................................... 63
2.4 Coding: Zero-Shot Classification .................................................................................... 64

2.5 Coding: Fill-Mask ................................................................................................................... 67

2.6 Coding: Question Answering ........................................................................................... 68

2.7 Coding: Named Entity Recognition ............................................................................... 70

2.8 Coding: Text-to-Image ........................................................................................................ 71

2.9 Exercise: Text-to-Audio ...................................................................................................... 72


2.9.1 Task ............................................................................................................................. 73
2.9.2 Solution ...................................................................................................................... 73
2.10 Capstone Project: Customer Feedback Analysis ...................................................... 74

2.11 Summary ................................................................................................................................... 77

3 Large Language Models 79

3.1 Brief History of Language Models .................................................................................. 80

3.2 Simple Use of LLMs via Python ........................................................................................ 81


3.2.1 Coding: Using OpenAI ........................................................................................... 81
3.2.2 Coding: Using Groq ................................................................................................ 84
3.2.3 Coding: Large Multimodal Models ................................................................... 87
3.2.4 Coding: Running Local LLMs ............................................................................... 90
3.3 Model Parameters ................................................................................................................. 93
3.3.1 Model Temperature ............................................................................................... 93
3.3.2 Top-p and Top-k ...................................................................................................... 95
3.4 Model Selection ...................................................................................................................... 96
3.4.1 Performance ............................................................................................................. 97
3.4.2 Knowledge Cutoff Date ........................................................................................ 98
3.4.3 On-Premise versus Cloud-Based Hosting ....................................................... 98
3.4.4 Open-Source, Open-Weight, and Proprietary Models ............................... 98
3.4.5 Price ............................................................................................................................. 99
3.4.6 Context Window .................................................................................................... 99
3.4.7 Latency ....................................................................................................................... 99
3.5 Messages ................................................................................................................................... 99
3.5.1 User ............................................................................................................................. 100

8 8
Contents

3.5.2 System ........................................................................................................................ 100


3.5.3 Assistant .................................................................................................................... 100
3.6 Prompt Templates ................................................................................................................. 101
3.6.1 Coding: ChatPromptTemplates ......................................................................... 101
3.6.2 Coding: Improve Prompts with LangChain Hub .......................................... 102
3.7 Chains ......................................................................................................................................... 104
3.7.1 Coding: Simple Sequential Chain ..................................................................... 105
3.7.2 Coding: Parallel Chain ........................................................................................... 106
3.7.3 Coding: Router Chain ............................................................................................ 109
3.7.4 Coding: Chain with Memory .............................................................................. 113
3.8 Safety and Security ............................................................................................................... 117
3.8.1 Security ...................................................................................................................... 118
3.8.2 Safety .......................................................................................................................... 118
3.8.3 Coding: Implementing LLM Safety and Security .......................................... 119
3.9 Model Improvements .......................................................................................................... 124

3.10 New Trends ............................................................................................................................... 125


3.10.1 Reasoning Models .................................................................................................. 126
3.10.2 Small Language Models ....................................................................................... 127
3.10.3 Test-Time Computation ....................................................................................... 128
3.11 Summary ................................................................................................................................... 130

4 Prompt Engineering 133

4.1 Prompt Basics .......................................................................................................................... 134


4.1.1 Prompt Process ........................................................................................................ 134
4.1.2 Prompt Components ............................................................................................. 135
4.1.3 Basic Principles ........................................................................................................ 136
4.2 Coding: Few-Shot Prompting ........................................................................................... 142
4.3 Coding: Chain-of-Thought ................................................................................................. 144

4.4 Coding: Self-Consistency Chain-of-Thought ............................................................. 145

4.5 Coding: Prompt Chaining ................................................................................................... 149

4.6 Coding: Self-Feedback ......................................................................................................... 151

4.7 Summary ................................................................................................................................... 155

9
Contents Contents

5 Vector Databases 157

5.1 Introduction ............................................................................................................................. 157

5.2 Data Ingestion Process ........................................................................................................ 159

5.3 Loading Documents .............................................................................................................. 160


5.3.1 High-Level Overview .............................................................................................. 161
5.3.2 Coding: Load a Single Text File .......................................................................... 161
5.3.3 Coding: Load Multiple Text Files ....................................................................... 163
5.3.4 Exercise: Load Multiple Wikipedia Articles .................................................... 164
5.3.5 Exercise: Loading Project Gutenberg Book .................................................... 166
5.4 Splitting Documents ............................................................................................................ 167
5.4.1 Coding: Fixed-Size Chunking .............................................................................. 169
5.4.2 Coding: Structure-Based Chunking .................................................................. 173
5.4.3 Coding: Semantic Chunking ............................................................................... 176
5.4.4 Coding: Custom Chunking .................................................................................. 178
5.5 Embeddings .............................................................................................................................. 182
5.5.1 Overview .................................................................................................................... 182
5.5.2 Coding: Word Embeddings ................................................................................. 184
5.5.3 Coding: Sentence Embeddings .......................................................................... 190
5.5.4 Coding: Create Embeddings with LangChain ............................................... 193
5.6 Storing Data ............................................................................................................................. 195
5.6.1 Selection of a Vector Database .......................................................................... 196
5.6.2 Coding: File-Based Storage with a Chroma Database ............................... 196
5.6.3 Coding: Web-Based Storage with Pinecone .................................................. 198
5.7 Retrieving Data ....................................................................................................................... 202
5.7.1 Similarity Calculation ............................................................................................ 202
5.7.2 Coding: Retrieve Data from Chroma Database ............................................ 204
5.7.3 Coding: Retrieve Data from Pinecone ............................................................. 205
5.8 Capstone Project .................................................................................................................... 207
5.8.1 Features ..................................................................................................................... 208
5.8.2 Dataset ....................................................................................................................... 209
5.8.3 Preparing the Vector Database .......................................................................... 209
5.8.4 Exercise: Get All Genres from the Vector Database ................................... 213
5.8.5 App Development .................................................................................................. 214
5.9 Summary ................................................................................................................................... 218

10 10
Contents

6 Retrieval-Augmented Generation 221

6.1 Introduction ............................................................................................................................. 222

6.2 Coding: Simple Retrieval-Augmented Generation ................................................ 225


6.2.1 Knowledge Source Setup ..................................................................................... 225
6.2.2 Retrieval ..................................................................................................................... 227
6.2.3 Augmentation ......................................................................................................... 228
6.2.4 Generation ................................................................................................................ 229
6.2.5 RAG Function Creation ......................................................................................... 230
6.3 Advanced Techniques .......................................................................................................... 232
6.3.1 Advanced Preretrieval Techniques ................................................................... 232
6.3.2 Advanced Retrieval Techniques ......................................................................... 234
6.3.3 Advanced Postretrieval Techniques ................................................................. 250
6.4 Coding: Prompt Caching ..................................................................................................... 250

6.5 Evaluation ................................................................................................................................. 256


6.5.1 Challenges in RAG Evaluation ............................................................................ 256
6.5.2 Metrics ....................................................................................................................... 257
6.5.3 Coding: Metrics ....................................................................................................... 259
6.6 Summary ................................................................................................................................... 261

7 Agentic Systems 263

7.1 Introduction to AI Agents .................................................................................................. 264

7.2 Available Frameworks ......................................................................................................... 265

7.3 Simple Agent ........................................................................................................................... 267


7.3.1 Agentic RAG .............................................................................................................. 267
7.3.2 ReAct ........................................................................................................................... 271
7.4 Agentic Framework: LangGraph ..................................................................................... 275
7.4.1 Simple Graph: Assistant ....................................................................................... 275
7.4.2 Router Graph ............................................................................................................ 279
7.4.3 Graph with Tools .................................................................................................... 284
7.5 Agentic Framework: AG2 ................................................................................................... 289
7.5.1 Two Agent Conversations .................................................................................... 290
7.5.2 Human in the Loop ................................................................................................ 293
7.5.3 Agents Using Tools ................................................................................................ 299
7.6 Agentic Framework: CrewAI ............................................................................................. 303
7.6.1 Introduction ............................................................................................................. 303

11
11
Contents Contents

7.6.2 First Crew: News Analysis Crew ........................................................................ 304


7.6.3 Exercise: AI Security Crew .................................................................................... 319
7.7 Agentic Framework: OpenAI Agents ............................................................................ 328
7.7.1 Getting Started with a Single Agent ................................................................ 328
7.7.2 Working with Multiple Agents .......................................................................... 329
7.7.3 Agent with Search and Retrieval Functionality ............................................ 332
7.8 Agentic Framework: Pydantic AI .................................................................................... 333

7.9 Monitoring Agentic Systems ............................................................................................ 336


7.9.1 AgentOps ................................................................................................................... 336
7.9.2 Logfire ......................................................................................................................... 340
7.10 Summary ................................................................................................................................... 342

8 Deployment 345

8.1 Deployment Architecture .................................................................................................. 345

8.2 Deployment Strategy .......................................................................................................... 347


8.2.1 REST API Development ......................................................................................... 347
8.2.2 Deployment Priorities ........................................................................................... 348
8.2.3 Coding: Local Deployment .................................................................................. 350
8.3 Self-Contained App Development ................................................................................. 355

8.4 Deployment to Heroku ....................................................................................................... 361


8.4.1 Create a New App ................................................................................................... 361
8.4.2 Download and Configure CLI .............................................................................. 362
8.4.3 Create app.py File ................................................................................................... 363
8.4.4 Procfile Setup ........................................................................................................... 365
8.4.5 Environment Variables ......................................................................................... 365
8.4.6 Python Environment ............................................................................................. 366
8.4.7 Check the Result Locally ....................................................................................... 366
8.4.8 Deployment to Heroku ......................................................................................... 367
8.4.9 Stop Your App .......................................................................................................... 368
8.5 Deployment to Streamlit ................................................................................................... 369
8.5.1 GitHub Repository .................................................................................................. 369
8.5.2 Creating a New App ............................................................................................... 370
8.6 Deployment with Render ................................................................................................... 372

8.7 Summary ................................................................................................................................... 374

12 12
Contents

9 Outlook 375

9.1 Advances in Model Architecture ..................................................................................... 375

9.2 Limitations and Issues of LLMs ........................................................................................ 376


9.2.1 Hallucinations ......................................................................................................... 376
9.2.2 Biases .......................................................................................................................... 377
9.2.3 Misinformation ....................................................................................................... 378
9.2.4 Intellectual Property .............................................................................................. 379
9.2.5 Interpretability and Transparency .................................................................... 379
9.2.6 Jailbreaking LLMs .................................................................................................... 379
9.3 Regulatory Developments ................................................................................................. 381

9.4 Artificial General Intelligence and Artificial Superintelligence ........................ 381

9.5 AI Systems in the Near Term ............................................................................................ 382

9.6 Useful Resources .................................................................................................................... 384

9.7 Summary ................................................................................................................................... 384

The Author ............................................................................................................................................... 387


Index .......................................................................................................................................................... 389

13
13
Chapter 6
Retrieval-Augmented Generation
Sometimes it is the people no one imagines anything of who do the
things no one can imagine.
—Alan Turing in the movie The Imitation Game
6

In this chapter, we’ll cover one of the most impactful concepts in the field of large lan-
guage models (LLMs): retrieval-augmented generation (RAG). So far, we have collected
different puzzle pieces like LLMs, prompt engineering, and vector databases, which will
now fit perfectly in place.
One important aspect is that vector databases, covered in Chapter 5, typically repre-
sent the backbone of a RAG system. For steering the model on how to create the final
output, we’ll come back to prompt engineering and reuse the concepts we learned in
Chapter 4. Finally, data will be passed to an LLM, and at that stage, we come back to
knowledge we gained in Chapter 4. In this regard, RAG is a mountain we can only climb
after having gained fitness in these three different disciplines.
We’ll start this chapter with an introduction to the concept of RAG in Section 6.1. After
gaining a general understanding, we’ll proceed into the individual process steps of
retrieval, augmentation, and generation.
We’ll first develop a simple RAG system in Section 6.2. This system will work surpris-
ingly well but won’t be perfect. For that reason, we’ll present an overview of advanced
RAG techniques in Section 6.3. These improvements can take place in different stages
of the RAG system. Advanced techniques for pre-retrieval phase are presented in Sec-
tion 6.3.1. In Section 6.3.2, techniques for the retrieval phase will be shown; in Section
6.3.3, techniques for the post-retrieval phase.
An alternative to RAG, called prompt caching, is the focus of our attention in Section
6.4. When you develop a RAG system and want to improve it, you’ll need metrics for
evaluating and quantifying these improvements. Thus, we dedicated Section 6.5 to
RAG evaluation techniques.
Let’s start with what RAG is, why it is needed, and how it works.

221
221
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

6.1 Introduction
Before we delve into how RAG works, let’s pause for a second and reflect on why it is
needed. Imagine you’re working with a large knowledge source (e.g., a long document),
and you want to “chat” with it.
The simplest approach would be to make an LLM call in which you pass your question
as well as the complete document as the context. This approach is usually not feasible
because the large document may exceed the context window of the LLM, and/or it will
not be efficient to send in each request the complete document. The request will take
comparably long to receive an answer but also could be quite costly because many
unnecessary tokens are sent in every single request.
One special solution is the use of prompt caching, which we’ll discuss in detail in Sec-
tion 6.4. But prompt caching is more a solution for edge cases. The more usual
approach is to use RAG. With RAG, in each LLM request, only the most relevant docu-
ments are passed along with the user query to the LLM. So, RAG results in an efficient
and quick solution while also overcoming some limitations of LLMs. With RAG, you
have the following capabilities:
 Access to company-internal information
 Access to up-to-date information that was created after the knowledge cutoff data
for LLM training
 More fact-based results

Figure 6.1 shows the general RAG process. A user creates a query that will be passed to a
retriever. The retriever is responsible for fetching relevant information from some
external source.

Retrieval Augmentation Generation

User Query
Retriever

RAG
Relevant
Documents Output
Large Language
Model

External Data Source Instructions

Retrieval Augmented Generation Pipeline

Figure 6.1 RAG General Pipeline

222 222
6 Retrieval-Augmented Generation
6.1 Introduction

In most cases, this external source is a vector database, but you’re not limited to vector
databases. Alternatively, you could connect to and search the internet for relevant
information. Ideally, the retriever finds some relevant information.
The next stage of the pipeline is the augmentation step. In this step, we use prompt
engineering to bundle instructions to the relevant documents. These instructions will
tell an LLM how to process the relevant documents but also how to behave when the
relevant documents are not a good match from which to extract information. As the
6
result of the augmentation step, we have documents and corresponding instructions.
These elements are now passed to the last stage of the pipeline: the LLM.
The LLM will process the instructions, extract information from the documents, and
create an output according to the user query.
The external source is, in most cases, a vector database. Let’s look at the RAG system
with a vector database in the backend, as shown in Figure 6.2.

Response LLM

Nearest Neighbors

Nearest Neighbor Search Embeddings

… Vector Database

Embedding Model

User Query … Text Documents

Figure 6.2 Simple RAG System Based on a Vector Database Backend

The vector database, which represents a corpus of documents and their embeddings,
must already be created, which we covered in Chapter 5.
In an RAG system that relies on a vector database, the user sends a query to the system.
This query is embedded based on the same embedding model that was used for embed-
ding the documents. The most similar documents to the query embedding are
retrieved from the database. In the next step, the user query together with the most
similar documents are passed to the LLM to formulate the response.
Now that you have a general idea, let’s dive into the details:
1. Retrieval process
In the retrieval process, the user query results in a search against some data source.

223
223
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

The user query, as a minimum, requires a text and a limitation indicating how many
documents should be returned. The query can be searched for in a variety of differ-
ent knowledge sources. The knowledge source represents a large number of docu-
ments. In this knowledge source, all documents are ranked based on their similarity
to the user query.

Knowledge Source Ranking

User Query
Relevance
{max_results: 3}

Rank 1 Rank 2 Rank 3 Rank N

Returned
Documents

Continue with
Augmentation

Figure 6.3 Retrieval Process

The user query is embedded, and its embedding is compared to all embeddings of
the documents in the knowledge source. Based on a similarity metric like cosine
similarity, the documents are ranked. A predefined number of most relevant docu-
ments is extracted and passed to the next stage—the augmentation.
2. Augmentation process
In the augmentation step, the retrieved information is combined with the original
query to create an enhanced prompt. This phase begins with the integration of rele-
vant documents or passages that the retrieval step has surfaced.
Typically, the retrieved information is clearly differentiated from the user’s query.
At this stage, prompt engineering comes into play and has a vital role. We need to
instruct the LLM on how to process the information. Explicit instructions are passed
about staying within the bounds of the information provided to avoid hallucina-
tions. Also, we should clearly state how to respond if the information does not
answer the user query.
The result is a prompt that provides the language model with both the context it
needs to generate an accurate response and the guidance that is required to use that
context effectively. The augmented prompt serves as the foundation for generating
responses that are relevant to the user query and grounded in the retrieved informa-
tion. In the next step, we’ll take a closer look at the generation process.

224 224
6.2 6 Retrieval-Augmented
Coding: Simple Retrieval-Augmented Generation

3. Generation process
The generation process in RAG represents the final crucial stage where the language
model transforms the augmented prompt into a coherent and contextually relevant
response. Unlike standard LLM output generation in traditional language models,
RAG-based generation incorporates the retrieved information while maintaining
natural language fluency.

When the language model receives the augmented prompt, it begins the process of 6
analyzing the retrieved context and the original query in tandem.
One of the key challenges during generation is maintaining consistency with the
retrieved information while avoiding hallucinations. The model must strike a good bal-
ance between purely extractive answers, like directly quoting the retrieved content,
and more abstract responses that synthesize and reformulate the information. This
balance often involves techniques where the model's generation is explicitly guided to
stay within the boundaries of the provided context.
While all this sounds quite complicated, actually it isn’t, as you’ll see as we start imple-
menting RAG in the next section.

6.2 Coding: Simple Retrieval-Augmented Generation


Let’s start developing our very first RAG application by mapping out what we’ll create.
Our small RAG system will be backed by a vector database. This vector database will
represent a knowledge base for human history, populated with knowledge fetched
from Wikipedia. We’ll load a number of Wikipedia articles on the topic and store them
in the database. Later, we’ll set up a RAG system that will interact with these documents
and answer questions based on that represented knowledge. The RAG system will also
clearly state if it does not know how to answer the question.

6.2.1 Knowledge Source Setup


First things first, we’ve got to pack our bags with all the tools we need. Listing 6.1 shows
all packages we’ll need in this script. You can already deduce many aspects of our
implementation, for example, that we’ll use Groq as LLM backend and that we’ll use
OpenAI embeddings for embedding our texts.

#%% packages
import os
from langchain_community.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv, find_dotenv

225
225
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

load_dotenv(find_dotenv(usecwd=True))
from langchain_groq import ChatGroq
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

Listing 6.1 Simple RAG System: Required Packages (Source: 06_RAG/10_simple_RAG.py)

You must also set up a configuration file that holds your application programming
interface (API) credentials for interacting with the LLM. That file is called .env and is
shown in Listing 6.2.

GROQ_API_KEY = …
OPENAI_API_KEY = …

Listing 6.2 Simple RAG System: API Credentials (Source: .env)

The first step is now to create the vector database. Since we covered this topic in Chap-
ter 5, we won’t bore you by explaining all the details again. If some steps in this chapter
are unclear, please return to Chapter 5.
The general setup is the following: The database will be stored in a folder called rag_
store. If this folder already exists, the assumption is that you already ran the code
(because it is created as part of the complete creation of the database). If the folder does
not exist yet, the database is created. The database creation consists of these steps:
1. Loading data from Wikipedia
2. Splitting the data into smaller chunks
3. Embedding the data using OpenAI embeddings
4. Storing these embeddings, together with the chunks, into the vector database

Listing 6.3 shows the steps for loading the dataset and storing it into a vector database.

#%% load dataset


persist_directory = "rag_store"
if os.path.exists(persist_directory):
vector_store = Chroma(persist_directory=persist_directory, embedding_
function=OpenAIEmbeddings())
else:
data = WikipediaLoader(
query="Human History",
load_max_docs=50,
doc_content_chars_max=1000000,
).load()

# split the data


chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=
200).split_documents(data)

226 226
6.2 6 Retrieval-Augmented
Coding: Simple Retrieval-Augmented Generation

# create persistent vector store


vector_store = Chroma.from_documents(chunks, embedding=OpenAIEmbeddings(),
persist_directory="rag_store")

Listing 6.3 Simple RAG System: Vector Database Creation/Loading (Source: 06_RAG/10_sim-
ple_RAG.py)

Next, we’ll walk through the three steps of RAG, namely, retrieval, augmentation, and
6
generation. We start with the retrieval step.

6.2.2 Retrieval
To retrieve something, we must create a retriever. Listing 6.4 shows how to set up a
retriever and how to invoke it. Helpfully, vector_store (our Chroma database in-
stance) has a method as_retriever() that allows you to define the vector database as
the retriever. Plus, we can define some important parameters like the similarity func-
tion to use for finding the most relevant documents, as well as how many documents
to return. Let’s try it out by setting up a question and subsequently invoking the re-
triever with that question.

retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 3})
question = "what happened in the first world war?"
relevant_docs = retriever.invoke(question)

Listing 6.4 Simple RAG System: Retriever Set Up (Source: 06_RAG/10_simple_RAG.py)

We can check qualitatively if the retrieval process works by printing the most similar
documents. For simplicity, we’ll limit the output for each document to just 100 charac-
ters, as shown in Listing 6.5.

#%% print content of relevant docs


for doc in relevant_docs:
print(doc.page_content[: 100])
print("\n--------------")

This transformation was catalyzed by wars of unparalleled scope and devastation.


World War I was a g
--------------
=== World War I ===
--------------
World War I saw the continent of Europe split into two major opposing alliances;
the Allied Powers,

227
227
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

--------------
A tenuous balance of power among European nations collapsed in 1914 with the
outbreak of the First W --------------

Listing 6.5 Simple RAG System: Retrieved Documents (Source: 06_RAG/10_simple_RAG.py)

Good! All documents found seem relevant since they are related to World War I or have
the starting year of 1914 mentioned.
But these samples are just snippets taken from chunks. You might try to figure out the
specific answer based on these snippets, but they can be hard to read and understand,
since they are not necessarily complete sentences, but only pieces of it. Thus, we don’t
yet have a good answer to the question. That is what RAG is for, because in the next step
we augment the documents with some proper instructions on how the documents
should be handled to answer the question.

6.2.3 Augmentation
In the augmentation step, we bring the most similar documents into a well-defined for-
mat: a long-concatenated string in which all document contents are represented. We
can create this string with join(). In this function, we pass a list that is then just concat-
enated into a single string. The list items are represented by page_content of our rele-
vant_docs, as follows:

context = "\n".join([doc.page_content for doc in relevant_docs])

This step was the tedious work; now, we need to get creative when defining a prompt
that instructs the LLM how to formulate an answer. As shown in Listing 6.6, we can
define the role that our LLM should play. We’ll instruct it to use the documents to
answer the question. Furthermore, we tell it to not hallucinate and instead to clearly
state that it does not know the answer if necessary.

#%% create prompt


messages = [
("system", "You are an AI assistant that can answer questions about the
history of human civilization. You are given a question and a list of documents
and need to answer the question. Answer the question only based on these
documents. These documents can help you answer the question: {context}. If you
are not sure about the answer, you can say 'I don't know' or 'I don't know the
answer to that question.'"),
("human", "{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages=messages)

Listing 6.6 Simple RAG System: Augmentation Setup (Source: 06_RAG/10_simple_RAG.py)

228 228
6.2 6 Retrieval-Augmented
Coding: Simple Retrieval-Augmented Generation

For a start, this step should be enough for augmentation. You can later come back to
this stage to improve the system. For now, we’re satisfied and can proceed with the
next stage—the generation of a final answer.

6.2.4 Generation
We’ve come to the last step—the generation of an answer. This step is straightforward.
We’ll create a model. In this case, we can use an open-weight model from Google called 6
gemma2. This model is a comparably small but is quite fast. Classified as a small lan-
guage model (SLM), gemma2 allows for very quick inferences, which might increase
user acceptance, because the answer is provided nearly instantaneously. The longest
process is the retrieval of the documents.
The second step in generation should be familiar to you from Chapter 3—the creation
of a chain. In this case, the chain starts with our prompt. The prompt output is passed
to the model. Finally, the model output is passed to StrOutputParser() so that only the
content part of the model output is shown, as follows:

model = ChatGroq(model_name="gemma2-9b-it", temperature=0)


chain = prompt | model | StrOutputParser()

We’ve placed our last jigsaw piece; let’s test it by invoking the chain and checking the
answer, as shown in Listing 6.7.

#%% invoke chain


answer = chain.invoke({"question": question, "context": context})
print(answer)

World War I was a global conflict from 1914 to 1918.

It involved two main alliances:

* **The Allied Powers:** Primarily composed of the United Kingdom, France,


Russia, Italy, Japan, Portugal, and various Balkan states.
* **The Central Powers:** Primarily composed of Germany, Austria-Hungary, the
Ottoman Empire, and Bulgaria.

The war resulted in the collapse of four empires: Austro-Hungarian, German,


Ottoman, and Russian. It had a death toll estimated between 10 and 22.5 million
people.

The war saw the use of new industrial technologies, making traditional military
tactics obsolete. It also witnessed horrific events like the Armenian,
Assyrian, and Greek genocides within the Ottoman Empire.

Listing 6.7 Simple RAG System: Generation (Source: 06_RAG/10_simple_RAG.py)

229
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

A pretty good answer, assuming that our test checks that the answer formulates a good
response based on the available knowledge from the source. An equally important step
is to also test the opposite functionality—if the model correctly identifies its own lim-
itations and clearly states that it does not know.
In a real-world example, you probably want to embed this functionality into a broader
context of code. So, what we need to do should be clear: We need to bundle everything
into a function.

6.2.5 RAG Function Creation


Our function simple_rag_system will consume a question and return a string. The other
steps are exactly the same steps as before:
1. Retrieving the relevant documents
2. Creating the context based on these relevant documents
3. Creating a detailed prompt that tells the model how to apply the context as well as
the user question
4. Creating a chain that bundles the prompt, the model, and an output handler
together

Listing 6.8 shows the definition of our simple_rag_system function, which consumes a
question as the input parameter and returns the RAG system answer.

# %% bundle everything in a function


def simple_rag_system(question: str) -> str:
relevant_docs = retriever.invoke(question)
context = "\n".join([doc.page_content for doc in relevant_docs])
messages = [
("system", "You are an AI assistant that can answer questions about the
history of human civilization. You are given a question and a list of documents
and need to answer the question. Answer the question only based on these
documents. These documents can help you answer the question: {context}. If you
are not sure about the answer, you can say 'I don't know' or 'I don't know the
answer to that question.'"),
("human", "{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages=messages)
model = ChatGroq(model_name="gemma2-9b-it", temperature=0)
chain = prompt | model | StrOutputParser()
answer = chain.invoke({"question": question, "context": context})
return answer

Listing 6.8 Simple RAG System: Function (Source: 06_RAG/10_simple_RAG.py)

230 230
6.2 6 Retrieval-Augmented
Coding: Simple Retrieval-Augmented Generation

Now, time to test it out. Does the function correctly return that it does not know an
answer? As shown in Listing 6.9, it does!

# %% Testing the function


question = "What is a black hole?"
simple_rag_system(question=question)

"I don't know the answer to that question. \n"


6
Listing 6.9 Simple RAG System: Test (Source: 06_RAG/10_simple_RAG.py)

Our RAG system does not know anything about black holes, which is exactly what we
hoped for, given that it relied on a knowledge source based solely on human history.
We achieved this limitation through the system message by passing the instruction
Answer the question only based on these documents. … If you are not sure about the answer,
you can say 'I don't know' or 'I don't know the answer to that question.
Let’s reflect on what we’ve learned so far. We’ve developed our first RAG system, based
on a vector database. Given a user question, the most relevant documents were
retrieved. These retrieved documents were used in combination with detailed instruc-
tions to the LLM on how to make use of them.
You’ll be impressed by how well this works. But after working with it for a while, you
might see some issues with this simple RAG. Issues might occur anywhere along the
complete pipeline, for example:
 Retrieval
The retrieved documents might not be relevant.
 Augmentation
– If multiple documents hold the same or very similar information, the retrieved
documents might be repetitive and thus are not very informative.
– If documents stem from different sources, their context might be hard to differ-
entiate, which might result in an incorrect mix of different documents in the final
answer. Imagine you have a system with controversial papers and contradicting
views. A terrible answer will result if the system assigns views to the wrong
authors.
 Generation
Issues at this step typically result from the upstream issues mentioned earlier in this
list.

Numerous possible improvements can be made, and in the following sections, we’ll
discuss some of these options.

231
231
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

6.3 Advanced Techniques


Although our simple RAG system is impressive, there’s still a lot of room for improve-
ment. Opportunities for improvement can be found at different stages in the pipeline.
For instance, some improvements can be made to the indexing pipeline.
Practically, we can improve the data before we add it to our vector database, which we’ll
cover when we discuss advanced pre-retrieval techniques in Section 6.3.1. Other tech-
niques tackle the problem at the retrieval step of the pipeline, which we cover in Sec-
tion 6.3.2. Unsurprisingly, the last step of the pipeline can be improved as well; we’ll
explore improvements in the generation step in Section 6.3.3.

6.3.1 Advanced Preretrieval Techniques


If the RAG system fails to fetch the right documents, the problem might be found in the
data ingestion pipeline. Several ways are available to improve the data before it is
stored into the vector database. Figure 6.4 shows some common approaches to opti-
mize the data ingestion pipeline.

Data Source Data Loading Data Chunking Embedding Storing

abc 0.3, 0.7, …, -0.4


abc
def

… def 0.8, 0.2, …, 0.1

xyz
xyx -0.5, -0.3, …, 0.6

Data Ingestion Metadata Chunk Size Embedding Model


Optimization Enhancing Optimization Finetuning
Techniques
Data Context
Cleaning Enrichment

Figure 6.4 Pre-Retrieval Techniques

The different techniques correspond to the step in which they are applied, which we’ll
discuss in the following sections.

Data Cleaning
Data cleaning in RAG systems involves preparing and refining the source documents
to optimize them for retrieval and generation. This crucial preprocessing step begins

232 232
6 Retrieval-Augmented
6.3 Advanced Generation
Techniques

with basic operations like removing duplicate content, standardizing formatting, and
handling special characters.
The text is then typically segmented into appropriate chunks that balance context
preservation with retrieval granularity—chunks that are too large may contain irrele-
vant information, while chunks that are too small might lose important context. (We
covered these topics in detail in Chapter 5.) Advanced cleaning steps might include the
following:
6
 Removing boilerplate content
 Standardizing date formats
 Resolving abbreviations
 Handling multilingual content

The cleaning process also often involves metadata enrichment, where chunks are
tagged with relevant identifiers, timestamps, or categorizations to enhance retrieval
accuracy. Clean data forms the foundation for effective RAG performance, as it directly
impacts both the quality of retrieved matches and the coherence of generated
responses.

Metadata Enhancing
Metadata enhancement in RAG systems involves enriching document chunks with
additional contextual information that goes beyond the raw text content. This process
includes tagging documents with attributes such as creation dates, authors, topics, and
categorical classifications.
More sophisticated enhancement might involve generating embeddings for semantic
search, calculating readability scores, or extracting key entities and relationships.
Some systems implement automated topic modeling or classification to add thematic
tags, while others maintain hierarchical relationships between chunks to preserve doc-
ument structure.
This rich layer of metadata not only improves retrieval accuracy but also enables more
nuanced filtering and contextual understanding during the generation phase. This
ultimately leads to more relevant and well-sourced responses.

Chunk Size Optimization


Chunk size optimization in RAG systems requires carefully balancing multiple compet-
ing factors to maximize retrieval effectiveness. While smaller chunks offer more pre-
cise retrieval and reduce token consumption, they risk fragmenting important context
and breaking up coherent ideas.
On the other hand, larger chunks preserve more context but can introduce noise and
irrelevant information into the retrieval results, potentially diluting the quality of
generated responses. As explored in Chapter 5, the optimal chunk size often varies

233
233
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

depending on the nature of the content. Technical documentation might benefit from
smaller, focused chunks, while narrative content might require larger chunks to main-
tain coherence.

Context Enrichment
Context enrichment in RAG systems involves augmenting the base content with addi-
tional information to enhance its utility and retrievability. This process goes beyond
basic metadata tagging by incorporating related information, cross-references, and
derived insights that make the content more valuable for retrieval and generation.
Advanced enrichment might include linking related concepts, adding definitions for
technical terms, or incorporating domain-specific knowledge. For example, when pro-
cessing medical documents, the system might automatically expand abbreviations,
add standardized medical codes, or link to relevant research papers.
Real-world applications might include adding geographic coordinates to location-
based content, temporal relationships for event-based information, or industry-spe-
cific taxonomies.
The enriched context helps the retrieval system make more intelligent matches and
provides the generation model with richer, more nuanced information to work with,
ultimately leading to more comprehensive and accurate responses.

Embedding Model Fine-Tuning


Embedding model fine-tuning in RAG systems adapts general-purpose embedding
models to better capture domain-specific semantic relationships and nuances. The
process typically begins with selecting a base embedding model like BERT (bidirec-
tional encoder representations from transformers) or sentence transformers, then
training it further on domain-specific data to better understand specialized vocabu-
lary, technical concepts, and industry-specific relationships.
The fine-tuning process requires careful curation of training pairs that represent the
types of semantic similarities most important for the specific use case—for instance, in
legal applications, fine-tuning might include matching different phrasings of the same
legal concept, while in technical documentation, fine-tuning might focus on connect-
ing problem descriptions with their solutions.

6.3.2 Advanced Retrieval Techniques


In this section, we’ll cover several advanced retrieval techniques. One limitation of
our simple RAG solution is that it is not well suited to finding specific keywords. For
keyword search, a few different algorithms are available. After learning about these
algorithms, we’ll use them in a hybrid RAG system. We’ll also cover reciprocal rank
fusion, which is a ranking technique that merges multiple retrieval results to improve

234 234
6 Retrieval-Augmented
6.3 Advanced Generation
Techniques

relevance without complex weighting schemes. In a practical implementation of a


hybrid search, we’ll combine vector search with keyword-based methods to enhance
document retrieval. You’ll also learn about query expansion, a technique that enhances
user queries. Finally, you’ll learn about context enrichment. With these techniques, you
can refine and expand retrieved passages by incorporating additional document-based
context.

6
Keyword-Search Algorithms
For keyword searches, multiple algorithms are available like TF-IDF (term frequency-
inverse document frequency) and BM25 (Best Match 25). A keyword search is also
known as a sparse vector search, in contrast to our well-known embeddings, which are
often called dense vector searches.
The most popular sparse vector searches are TF-IDF or BM25. Both are numerical statis-
tics used in information retrieval. Their purpose is to assess the importance of a word
in a document relative to a collection of documents, as follows:
 Term frequency-inverse document frequency (TF-IDF)
TF-IDF is the product of term frequency and inverse document frequency. Term fre-
quency (TF) represents the number of times a term appears in a document. Inverse
document frequency (IDF) measures how unique or rare a word is across all docu-
ments in the corpus. The more documents a word appears in, the lower its IDF value
becomes.
Words like “the” or “is” appear in almost every document, so their IDF values are low.
Other terms that are rare across several documents (i.e., they appear in just a few
documents) thus have a high IDF value.
Terms with high TF-IDF values indicate that a term frequently appears in a particular
document but is not common in the corpus of documents. These terms are consid-
ered more relevant for identifying key topics of a document.
TF-IDF is a popular search algorithm for ranking documents based on how relevant
they are to a user query. The idea is to give more weight to terms that are significant
in the document but rare in the corpus.
 Best Match 25 (BM25)
A related concept that builds upon TF-IDF is BM25, which introduces a sophisticated
way of weighting terms based on term frequency, document length, and term satu-
ration.
BM25 adds the concept of term frequency. In TF-IDF, a term’s importance increases
linearly with its frequency. BM25 uses instead a saturation function. In other words,
after a certain point, additional occurrences of a term in a document don’t further
increase the relevance. The rationale behind this approach is that, if a term appears
multiple times in a document, it has a certain impact, but above a certain threshold
its impact reduces.

235
235
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

Another applied concept is document length normalization. TF-IDF does not bal-
ance between long and short documents. To overcome this lack, BM25 adjusts for a
document’s length. The rationale is that longer documents naturally have more
occurrences of a term, so BM25 normalizes the score to prevent bias towards longer
documents.
BM25 is widely used in modern search engines because it overcomes multiple lim-
itations of TF-IDF. With saturation, it better models real texts, where term frequency
eventually stops increasing the relevance of a term. Also, it compensates for longer
documents with document length normalization.

To summarize, we have two different worlds: sparse vector search and dense vector
search. Now, you know how to work with both individually. However, we want to
search for the best of both worlds. Thus, we need a way to combine search results from
both sparse and dense searches.

Coding: Sparse Search with BM25 and TF-IDF


Let’s implement both BM25 and TF-IDF to compare their results. You can find the corre-
sponding script at 06_RAG/25_BM25_TFIDF.py.
Listing 6.10 shows how to load relevant packages and create a function to preprocess
text. A new functionality that we’ll use is BM250kapi from the rank_bm25 package.

#%% packages
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from typing import List
import string
#%% Documents
def preprocess_text(text: str) -> List[str]:
# Remove punctuation and convert to lowercase
text = text.lower()
# remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
return text.split()

Listing 6.10 TF-IDF and BM25: Packages and Preprocessing Function (Source: 06_RAG/25_
BM25_TFIDF.py)

In the preprocessing function, we want to convert all text to lowercase. Then, all punc-
tuation is removed. Finally, all the text is split at each space. As a result, we have a list of
single words for each document in the corpus. Now, we need some text to process and
play with. This corpus is shown in Listing 6.11.

236 236
6 Retrieval-Augmented
6.3 Advanced Generation
Techniques

corpus = [
"Artificial intelligence is a field of artificial intelligence. The field
of artificial intelligence involves machine learning. Machine learning is an
artificial intelligence field. Artificial intelligence is rapidly evolving.",
"Artificial intelligence robots are taking over the world. Robots are
machines that can do anything a human can do. Robots are taking over the world.
Robots are taking over the world.",
"The weather in tropical regions is typically warm. Warm weather is common 6
in these regions, and warm weather affects both daily life and natural
ecosystems. The warm and humid climate is a defining feature of these
regions.",
"The climate in various parts of the world differs. Weather patterns change
due to geographic features. Some regions experience rain, while others are
dry."
]

Listing 6.11 TF-IDF and BM25: Corpus (Source: 06_RAG/25_BM25_TFIDF.py)

The first document is about artificial intelligence and shows many occurrences of this
term. Actually, the term is repeated excessively, so our expectation is that TF-IDF will
over-rank it compared to BM25. The second deals with a related topic: “robots.” The
third document repeats the term “warm” several times, and the last document is rather
generic.
Now, let’s start preprocessing the corpus. One essential point at this step is to check
how these different functionalities expect the input data to be passed. We worked with
TfidfVectorizer earlier in this section. This functionality expects documents to be
passed as a list of strings (list[str]). Thus, for each document, we want to remove the
stopwords but keep the document as a string of multiple words. The BM250kapi class
requires a different format: list[List[str]]. With this format, each document is split
into its individual words, and all the words are then passed as a list of strings. This
approach is extremely important to get the correct results.

Tokenized Query BM25: ['artificial', 'intelligence', 'involves', 'learning']


Tokenized Query TFIDF: artificial intelligence involves learning
BM25 Similarities: [2.11043413 0. 0. 0. ]

Listing 6.12 shows how the similarity calculation is implemented. First, the documents
are preprocessed. The resulting object tokenized_corpus provides the data in the format
that BM25 class requires. Then, we’ll initialize bm25 as an instance of BM250kapi class. The
user_query is defined and tokenized. Remember, BM25 and TF-IDF classes require the
user query in different formats. Finally, we’ll calculate the scores with bm25.get_
scores() and print the results to the screen.

237
237
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

# Preprocess the corpus


tokenized_corpus = [preprocess_text(doc) for doc in corpus]
# %% Sparse Search (BM25)
bm25 = BM25Okapi(tokenized_corpus)

#%% Set up user query


user_query = "artificial intelligence involves learning"

# Process query to remove stop words


tokenized_query_BM25 = user_query.lower().split()
tokenized_query_tfidf = ' '.join(tokenized_query_BM25)
# Process query to remove stop words

bm25_similarities = bm25.get_scores(tokenized_query_BM25)
print(f"Tokenized Query BM25: {tokenized_query_BM25}")
print(f"Tokenized Query TFIDF: {tokenized_query_tfidf}")
print(f"BM25 Similarities: {bm25_similarities}")

Tokenized Query BM25: ['artificial', 'intelligence', 'involves', 'learning']


Tokenized Query TFIDF: artificial intelligence involves learning
BM25 Similarities: [2.11043413 0. 0. 0. ]

Listing 6.12 TF-IDF and BM25: Similarity Calculation (Source: 06_RAG/25_BM25_TFIDF.py)

The first few lines just showcase the different input formats of BM25 and TF-IDF que-
ries. In the last output line, you’ll see the similarities between the user query and the
documents. The highest and only similarity is found for the first document.
Next, let’s implement TF-IDF, as shown in Listing 6.13, and compare the results. Create
an instance of TfidfVectorizer() and embed the documents with tfidf.fit_trans-
form(). The user query is embedded, and finally, its similarity to the TF-IDF matrix can
be calculated.

#%% calculate tfidf


tfidf = TfidfVectorizer()
tokenized_corpus_tfidf = [' '.join(words) for words in tokenized_corpus]
tfidf_matrix = tfidf.fit_transform(tokenized_corpus_tfidf)

query_tfidf_vec = tfidf.transform([tokenized_query_tfidf])
tfidf_similarities = cosine_similarity(query_tfidf_vec, tfidf_matrix).flatten()
print(f"TFIDF Similarities: {tfidf_similarities}")

TFIDF Similarities: [0.6630064 0.08546995 0. 0. ]

Listing 6.13 TF-IDF and BM25: Similarity Calculation (Source: 06_RAG/25_BM25_TFIDF.py)

238 238
6 Retrieval-Augmented
6.3 Advanced Generation
Techniques

The first two documents show some similarity, and the algorithm correctly returns the
first document as the most similar.
At this point, reciprocal rank fusion comes to the rescue. But before turning to that
topic, let’s pause for a second and look at the full picture: our hybrid RAG pipeline.

Hybrid RAG Pipeline


Now, you can build your own hybrid RAG systems. Figure 6.5 shows the mode of oper- 6
ation for a hybrid RAG pipeline.

Retrieval Augmentation Generation


User
Query

RAG
Dense Search Rank Relevant Output
Fusion Documents
Large Language
Model
Sparse Search
Instructions

Hybrid RAG Pipeline

Figure 6.5 Hybrid RAG: Pipeline

Compared to the diagram shown earlier in Figure 6.1, which shows a general, simple
RAG, notice that all changes so far concern the retrieval step of the pipeline.
The user query is passed to a dense search, typically, a vector database. But at the same
time, the query is also passed to a sparse search, like TF-IDF or BM25. As a result, we get
two ordered lists of documents: one coming from the similarity search in the vector
database, the other coming from the keyword search (sparse search) algorithm. We
only need and can only work with one ordered list of documents. Thus, we need to
aggregate these lists; at this point, rank fusion comes into play.
The remaining parts of this extended RAG pipeline are identical to our simple RAG sys-
tem. The ordered list of most relevant documents is, together with a set of instructions,
augmented. And the augmented prompt is passed to the LLM for processing the infor-
mation and coming up with an answer.
One question is left to answer: How are the outputs from dense and sparse search com-
bined? The algorithm that is used here is called reciprocal rank fusion.

239
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

Reciprocal Rank Fusion


Reciprocal rank fusion is an algorithm used to combine multiple ranked lists into a sin-
gle, unified ranking.
Other algorithms for rank fusion are available like CombSUM, CombMNZ, or Borda
Count. But reciprocal rank fusion stands out for its simplicity, its robustness, and its
ability to handle incomplete rankings. Let’s find out how it works. For each item in the
different ranked lists, reciprocal rank fusion calculates a score based on the following
formula:

For each document d (out of a corpus of documents D), a score is calculated based on its
ranks in the different lists. k is a constant that is often set to 60. Thus, the only variable
part of the formula is r(d), which is the rank of the document in a given list. Then, the
sum is calculated over all lists.
Let’s look at a real example. Figure 6.6 shows how reciprocal rank fusion works for four
documents and displays their ranks as resulting from a dense search and from a sparse
search.

Rank Dense Search Result Sparse Search Result Reciprocal


Rank
1 Document D Document A 1/1 Document A 1/2+1/1=1.5

2 Document A Document C 1/2 Document D 1/1+1/4=1.25

3 Document C Document B 1/3 Document C 1/3+1/2=0.83

4 Document B Document D 1/4 Document B 1/4+1/3=0.58

a) Reciprocal Rank Calculation b) Reciprocal Rank Fusion

Figure 6.6 Reciprocal Rank Fusion: Rank Calculation (Left) and Fusion (Right)

The ranks from the two lists and the four documents are shown on the left. The ranks
are increased, and the corresponding reciprocal ranks are just the result of 1 ÷ rank. In
this example, Document D has rank 1 in the dense search result and rank 4 in the sparse
search result.
The way to aggregate both results is simple. The reciprocal ranks are summed. Keeping
our focus on Document D: It gets a reciprocal rank of 1/1 from dense search, and 1/4
from sparse search. Both are summed to 1.25. The same approach is applied all other
documents.

240 240
6 Retrieval-Augmented
6.3 Advanced Generation
Techniques

Subsequently, the documents are sorted in decreasing order based on their reciprocal
rank sums. As a result, Document A is the most relevant document, followed by Docu-
ment D, and so on.
The reciprocal rank algorithm has the beneficial mathematical property of diminishing
returns. The score decreases non-linearly as its rank increases. Intuitively, this relation-
ship makes sense since a difference in low ranks (like 1 and 2, or 2 and 3) is much more
important than the difference in rank between rank 1000 and 1001.
6

Coding: Hybrid Search


In this code exercise, you’ll set up a hybrid search that applies both a dense search and
a sparse search. The two resulting lists are then merged based on rank fusion. Along the
way, you’ll discover why hybrid searches might outperform any of the individual
search engines.
Let’s start with loading all the required functionality. Listing 6.14 shows the required
packages. Notable additions include TfidfVectorizer to perform TF-IDF and ENGLISH_
STOP_WORDS to remove unnecessary words that might distract TF-IDF.

from langchain_openai import OpenAIEmbeddings


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(usecwd=True))

Listing 6.14 Hybrid Search: Required Packages (Source: 06_RAG/20_hybrid_search.py)

We need some sentences to play with, so we’ll set up a list of docs, as shown in Listing
6.15.

#%% Documents
docs = [
"The weather tomorrow will be sunny with a slight chance of rain.",
"Dogs are known to be loyal and friendly companions to humans.",
"The climate in tropical regions is warm and humid, often with frequent
rain.",
"Python is a powerful programming language used for machine learning.",
"The temperature in deserts can vary widely between day and night.",
"Cats are independent animals, often more solitary than dogs.",
"Artificial intelligence and machine learning are rapidly evolving
fields.",
"Hiking in the mountains is an exhilarating experience, but it can be
unpredictable due to weather changes.",
"Winter sports like skiing and snowboarding require specific types of
weather conditions.",

241
241
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

"Programming languages like Python and JavaScript are popular choices for
web development."
]

Listing 6.15 Hybrid Search: Sample Documents (Source: 06_RAG/20_hybrid_search.py)

We start with the sparse similarity analysis based on TF-IDF. For this algorithm, be
aware that it might be distracted by stopwords. Stopwords are common words that are
typically filtered out during text preprocessing in natural language processing (NLP)
tasks. These words are usually the most common words in a language that don't con-
tribute significantly to the meaning or sentiment of a text.
In English, common stopwords include articles (e.g., a, an, the); prepositions (in, on, at,
with, by); pronouns (I, you); conjunctions (and, but, or); and verbs (is, am, are, was,
were). Depending on the algorithm, removing these stopwords might be an important
step to reduce noise, improve efficiency, and focus on meaning. In practice, stopword
removal looks like this:
 Original text: “The car is sitting on the mat”
 After stopword removal: “car sitting mat”

The object docs_without_stopwords can be created based on list comprehension. In this


process, each word in a string is separated by splitting at each space with doc.split().
Then, all words are converted to lowercase, and a word is only retained if it is not part
of ENGLISH_STOP_WORDS. Finally, the resulting list is converted to a string with join(), as
follows:

docs_without_stopwords = [
' '.join([word for word in doc.split() if word.lower() not in ENGLISH_STOP_
WORDS])
for doc in docs
]

We’ll set up a sparse search and a dense search, starting with a sparse search based on
TF-IDF. We’ll instantiate a TfidfVectorizer before we can create a tfidf_matrix, as fol-
lows:

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs_without_stopwords)

Before we can calculate similarities based on cosine_similarity, we need a user_query


embedded with vectorizer.transform, as follows:

user_query = "Which weather is good for outdoor activities?"

query_sparse_vec = vectorizer.transform([user_query])

242 242
6 Retrieval-Augmented
6.3 Advanced Generation
Techniques

sparse_similarities = cosine_similarity(query_sparse_vec, tfidf_


matrix).flatten()

The object sparse_similarities is of the type array, and for 10 sample sentences, it
has 10 similarity values corresponding to our user query. In the next step, we want to
get the indices of the most similar documents. For this step, we’ll create a function
getFilteredDocsIndices. This function consumes the similarities and additionally a
threshold parameter that excludes all documents below the value. 6
Multiple ways exist to get this done. As shown in Listing 6.16, we can sort the similari-
ties, filter them if they are below the threshold value, and finally return only the filtered
list of document indices.

def getFilteredDocsIndices(similarities, threshold = 0.0):


filt_docs_indices = sorted(
[(i, sim) for i, sim in enumerate(similarities) if sim > threshold],
key=lambda x: x[1],
reverse=True
)
return [i for i, sim in filt_docs_indices]

Listing 6.16 Hybrid Search: Function for Getting Filtered Indices (Source: 06_RAG/20_hybrid_
search.py)

Let’s try the function out with our sparse_similarities, as shown in Listing 6.17.

#%% filter documents below threshold and get indices


filtered_docs_indices_sparse = getFilteredDocsIndices(similarities=sparse_
similarities, threshold=0.2)
filtered_docs_indices_sparse

[0, 7, 8]

Listing 6.17 Hybrid Search: Filter Sparse Documents (Source: 06_RAG/20_hybrid_search.py)

In this case, the most similar document is the document with index 0, followed by
index 7, and then index 8.
If you check the user query and the sample sentences, you’ll find that this does not
work very well. We asked, “Which weather is good for outdoor activities?” and TF-IDF
search returned that the most similar document is “The weather tomorrow will be
sunny with a slight chance of rain.” That response is just not correct. The search did find
the appropriate documents “Hiking in the mountains is an exhilarating experience,
but it can be unpredictable due to weather changes“ and “Winter sports like skiing and
snowboarding require specific types of weather conditions,” but the rank order was
wrong. The model puts too much focus on the word “weather” and thus incorrectly
returns the document regarding the weather forecast.

243
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

Let’s move on and evaluate the performance of the dense search. For this step, we must
embed our sample documents. We’ll use OpenAI for that purpose, as follows:

embeddings = OpenAIEmbeddings()
embedded_docs = [embeddings.embed_query(doc) for doc in docs]

Then, we must embed the user_query:

query_dense_vec = embeddings.embed_query(user_query)

Finally, based on cosine similarity, we get a list of similarity scores for each document,
as shown in Listing 6.18.

dense_similarities = cosine_similarity([query_dense_vec], embedded_docs)


dense_similarities

array([[0.81677377, 0.74636589, 0.78294997, 0.71723575, 0.783497 ,


0.71934968, 0.71673343, 0.84183792, 0.84227999, 0.73753417]])

Listing 6.18 Hybrid Search: Get Dense Similarities (Source: 06_RAG/20_hybrid_search.py)

So far, so good. But what we really want is a list of the most similar documents. Earlier,
we created a function for filtering and returning indices of documents, and we’ll reuse
this function now. To get a small result list of documents, we’ll pass a threshold param-
eter of 0.8, as shown in Listing 6.19.

filtered_docs_indices_dense = getFilteredDocsIndices(similarities=dense_
similarities[0], threshold=0.8)
filtered_docs_indices_dense

[8, 7, 0]

Listing 6.19 Hybrid Search: Filter Dense Documents (Source: 06_RAG/20_hybrid_search.py)

This list looks better. It has found the documents on outdoor activities (index 8 and
index 7) and provides them a rank higher than the document on the weather forecast
(index 0).
Now, we have two similarity lists and need to aggregate them into one final list. At this
point, reciprocal rank fusion comes into play, as shown in Listing 6.20.

def reciprocal_rank_fusion(filtered_docs_indices_sparse, filtered_docs_indices_


dense, alpha=0.2):
# Create a dictionary to store the ranks
rank_dict = {}

# Assign ranks for sparse indices


for rank, doc_index in enumerate(filtered_docs_indices_sparse, start=1):

244 244
6 Retrieval-Augmented
6.3 Advanced Generation
Techniques

if doc_index not in rank_dict:


rank_dict[doc_index] = 0
rank_dict[doc_index] += (1 / (rank + 60)) * alpha

# Assign ranks for dense indices


for rank, doc_index in enumerate(filtered_docs_indices_dense, start=1):
if doc_index not in rank_dict:
rank_dict[doc_index] = 0 6
rank_dict[doc_index] += (1 / (rank + 60)) * (1 - alpha)

# Sort the documents by their reciprocal rank fusion score


sorted_docs = sorted(rank_dict.items(), key=lambda item: item[1], reverse=
True)

# Return the sorted document indices


return [doc_index for doc_index, _ in sorted_docs]

Listing 6.20 Hybrid Search: Function For Reciprocal Rank Fusion (Source: 06_RAG/20_hybrid_
search.py)

Listing 6.20 shows the implementation: The function consumes the two lists of
indices—one from the sparse search and the other from the dense search. Additionally,
we can add a weighting parameter alpha, which assigns different weights to the differ-
ent searches.
A resulting rank_dict is initialized as an empty list. There are two loops, one for each list
of indices. Each time, it is iterated over all documents in these lists. For the document
index, the rank value is added based on the formula (1 / (rank + 60)) * alpha (for the
sparse search) and (1 / (rank +60)) * (1 - alpha) (for the dense search).
As a result, we get a list of aggregated document indices, in decreasing order. The first
list element represents the most similar document based on reciprocal rank fusion.
At this point, we can apply this function to our data. With an alpha value of 0.2, we’re
giving the sparse search a weighting of 20% in the final result, and the dense search gets
a weighting of 80%, as follows:

reciprocal_rank_fusion(filtered_docs_indices_sparse, filtered_docs_indices_
dense, alpha=0.2)

[8, 7, 0]

The final result is identical to the dense search. But play with the weighting factors, and
you’ll see that the order changes at some point.
This hybrid search could be integrated into a hybrid RAG system. The corresponding
documents to these indices would then be passed as context information to the LLM

245
245
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

together with the user query and additional instructions. As you can imagine, however,
typically you don’t need to implement such features like reciprocal rank fusion on your
own; these capabilities are already integrated into vector databases.
In this section, we implemented a hybrid search based on a sparse search with TF-IDF.
Another approach for improving the retrieval is query expansion, which we’ll turn to
next.

Coding: Query Expansion


You might know the saying, “Most computer problems sit between the keyboard and
the chair.” While it might seem lazy to say that our RAG system is fine and the user is at
fault, a kernel of truth lies within. Ultimately, at the end of the day, a RAG response can
only be as good as the user query. How much effort do you put into the setting up of
your queries?
One possible answer (which, admittedly, is our answer) is that querying is an iterative
approach, which is refined over a series of steps. If a user query is imprecise, we should
not expect the RAG response to be perfect. But still some approaches can help the user
by improving the query. One possible solution is query expansion. This technique
enhances the ability to retrieve relevant documents. Let’s look at several expansion
approaches next:
 Synonym expansion
– Expanding the query with synonyms ensures retrieval of documents that use dif-
ferent terminology but discuss the same concept.
– Original query → Expanded query
– Example: “Climate change” → [“global warming,” “climate change,” “environ-
mental impact”]
 Related terms expansion
– Including related terms that are often discussed together with the original query
topic improves document retrieval that covers broader or more specific aspects
of the topic.
– Example: “Machine learning” → [“machine learning,” “deep learning,” “artificial
intelligence,” “neural networks”]
 Conceptual expansion
– Adding subcategories or specific types of the original concept allows for retriev-
ing a wider range of documents related to different forms of renewable energy.
– Example: “Renewable energy” → [“renewable energy,” “solar power,” “wind
energy,” “green technology,” “sustainable energy”]
 Phrase variations expansion
– Including variations of how a concept might be phrased ensures retrieval from
documents that use different wording to discuss the same idea.

246 246
6 Retrieval-Augmented
6.3 Advanced Generation
Techniques

– Example: “Health benefits of exercise” → [“health benefits of physical activity,”


“exercise health benefits,” “positive effects of exercise”]
 Contextual expansion
– Including related fields or tasks in the expansion can help retrieve more diverse
documents that approach the original concept from different angles.
– Example: “Natural language processing” → [“natural language processing,” “text
analysis,” “computational linguistics,” “language modeling”] 6
 Temporal expansion
– Expanding for common abbreviations and alternative names ensures that rele-
vant documents using different temporal or popular references are also
retrieved.
– Example: “COVID-19 vaccines” → [“COVID-19 vaccines”, “COVID vaccines”, “coro-
navirus immunization”, “SARS-CoV-2 vaccine”]
 Entity-based expansion
– When a query involves an entity, expanding the query with related people, prod-
ucts, or attributes associated with the entity can enhance retrieval quality.
– Example: “Tesla” → [“Tesla,” “Elon Musk,” “electrical vehicles,” “autonomous
driving”]

By incorporating these types of expansions, you can maximize the information


retrieved for a query, resulting in more effective responses in a RAG system.
Implementing query expansion is a rather easy task, as in the following:

from langchain_groq import ChatGroq


from langchain_core.prompts import ChatPromptTemplate
from dotenv import load_dotenv
load_dotenv('.env')

Listing 6.21 shows the function definition. We’ve set up a function query_expansion for
improving the query. The function receives the query and returns the number of
expanded queries.

def query_expansion(query: str, number: int = 5, model_name: str = "llama3-70b-


8192") -> list[str]:
messages = [
("system","""You are part of an information retrieval system. You are
given a user query and you need to expand the query to improve the search
results. Return ONLY a list of expanded queries.
Be concise and focus on synonyms and related concepts.
Format your response as a Python list of strings.
The response must:
1. Start immediately with [

247
247
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

2. Contain quoted strings


3. End with ]
Example correct format:
["alternative query 1", "alternative query 2", "alternative query 3"]
"""),
("user", "Please expand the query: '{query}' and return a list of
{number} expanded queries.")
]
prompt = ChatPromptTemplate.from_messages(messages)
chain = prompt | ChatGroq(model_name=model_name)
res = chain.invoke({"query": query, "number": number})
return eval(res.content)

Listing 6.21 Query Expansion: Function (Source: 06_RAG/90_query_expansion.py)

The major work is in setting up the messages, especially convincing the model to avoid
babbling and instead directly returning a string that starts with the list. You might
choose to implement Pydantic to get a structured output as a list of strings. We leave
this task up to you as a quick exercise.
The final result should look like “[…]” instead of something like “Here is the expanded
list of queries:\n\n[…].” You cannot simply instruct the model to return a Python list,
but you should provide detailed instructions on how the response should look like. We
were successful with the instruction set shown in Listing 6.22.

The response must:


1. Start immediately with [
2. Contain quoted strings
3. End with ]
Example correct format:
["alternative query 1", "alternative query 2", "alternative query 3"]

Listing 6.22 Query Expansion: Output Format (Source: 06_RAG/90_query_expansion.py)

The user message holds the information to process the query and return the number of
expanded queries.
In the return statement, the model response is evaluated to a Python object, so that we
successfully get a list of strings holding the expanded queries. We can test this func-
tionality out, as follows:

res = query_expansion(query="artificial intelligence", number=3)


res

['machine learning', 'natural language processing', 'computer vision']

248 248
6 Retrieval-Augmented
6.3 Advanced Generation
Techniques

Context Enrichment
If you work with a corpus of multiple different documents, distinguishing between dif-
ferent documents can be difficult. For example, if you work with financial documents,
an essential task might be to distinguish between quarters and companies. Normal
chunking techniques can have difficulties with this kind of thinking.
Figure 6.7 shows the concept behind conceptual retrieval, which can overcome this
issue. Each chunk is enhanced with its own context. This context provides useful infor- 6
mation on the location of the chunk in a specific document. For example, the enhanced
chunk might refer to the previous and next quarter of the company under study.

Pass Complete Corpus


with Prompt Caching
Context 1 +
Chunk 1 Chunk 1

Chunk 2 Context 2 +
Chunk 2

LLM … Vector Database
Chunk X
Context X +
Chunk X

Corpus of Documents

Figure 6.7 Contextual Retrieval (Source: Adapted from https://www.anthropic.com/news/


contextual-retrieval)

How does contextual retrieval work? Each document from a larger corpus is chunked.
The complete corpus, or the relevant documents from the corpus, are passed to the
LLM, together with a chunk for extracting contextual information for that particular
chunk.
As you can imagine, sending a complete, relevant document together with each chunk
to an LLM can become extremely costly. Now, the concept of prompt caching comes
into play. Since prompt caching is a RAG alternative, we’ll cover it in more detail in Sec-
tion 6.4. But to anticipate this topic, just know that prompt caching is a cheap solution
to the problem. With this technique, the complete relevant document is cached for a
limited time and does not need to be sent each time to the LLM.
In the end, all chunks are processed, and each chunk was enhanced with contextual
information. This new chunk (context plus original chunk) can now be stored in some
vector database. Studies have found that this approach can significantly outperform
both simple RAG systems and hybrid RAG systems.

249
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

6.3.3 Advanced Postretrieval Techniques


After the documents are retrieved, you can still improve upon your RAG system. One
interesting approach is prompt compression. Imagine that the prompt enhanced with
all contexts from the retrieval process is getting long. As a result, the LLM must deal
with increased noise, which reduces the LLM’s performance. Also, since you need to
pay for tokens, the cost is high. What can you do to save costs? You can compress the
length of the context while maintaining its essential information, which is the idea
behind prompt compression, as shown in Figure 6.8.

User Query Small LLM Compressed Query LLM


Very long Irrelevant tokens
context removed

▪ Long context ▪ Cheap LLM ▪ Irrelevant tokens


▪ Increased noise removed
▪ Reduced LLM
performance

Figure 6.8 Prompt Compression

The augmented user query, which possibly has become quite long, is sent to an LLM to
be shortened in a way that maintains its meaning. But if we send the augmented query
to an LLM to shorten it, we then need to send it to the “final” LLM to get the RAG
response. We even have two LLM calls—so what is the point?
This approach only makes sense if you send the long, augmented prompt to a small,
fast, and cheap LLM for shortening. For the RAG response, you can then send the
prompt to the LLM of your choice.
If you have many LLM calls and large contexts, this approach can significantly lower
your costs, increase the performance, and enhance efficiency.

6.4 Coding: Prompt Caching


Prompt caching is a new approach developed by Anthropic and is provided currently
only as a beta feature. Recall that passing a complete document to the LLM with every
request is inefficient. While still true, Anthropic has developed an approach that
addresses this problem. Figure 6.9 shows how the concept of prompt caching works.

250 250
6 Retrieval-Augmented Generation
6.4 Coding: Prompt Caching

Complete LLM
User Query 1 Document Corpus Cache LLM LLM Response

User Query 1

User Query 1

Figure 6.9 Prompt Caching

With prompt caching, the user sends a request to the LLM. In this request, the user
passes the query together with the complete document. Wait, what? Didn’t we just say
that sending the complete is extremely inefficient?
Yes, but the user also tells the model to cache the document. The LLM provider caches
the complete document when passed the first time. The document is not cached for-
ever, only for a limited time. At the time of writing, a document is cached for 5 minutes.
Every subsequent user query within that time interval sent to the LLM together with
the complete document does not require a new caching. Instead, the document is
taken from the cache to create the answer.
Some specific use cases for this technique include the following:
 Repetitive tasks
 Running many examples
 Long conversations
 Using RAG as contextual retriever (as discussed earlier in Section 6.3.2)

Of these use cases, we want to focus on the last one. This technique is an enabler of a
technique called contextual retriever which can significantly improve your RAG
results. For now, we’ll focus on how to implement prompt caching. Prompt caching
enables a technique called contextual retrieval, which is outside the scope of this book.
You can read more about this topic at https://www.anthropic.com/news/contextual-
retrieval.
We’ll develop a continuous chat with an LLM. The conversation is based on some fixed
context. In our example, we’ll use the complete book The Hound of the Baskervilles. This

251
251
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

context information is cached for a limited time, so that we create the cached tokens
once—at the beginning of the conversation. With a normal RAG, we would send this
context with every request as input tokens. But since the document is now cached, we
can achieve some inference gains (faster model responses), but also dramatically
reduced costs.
The creation of the cache is slightly more expensive than sending normal input tokens
(by about 25%). The idea is to outweigh this added effort with massively reduced costs
since reading cached tokens is 90% cheaper than reading input tokens.
Listing 6.23 shows the packages we’ll use in this script. As the LLM, we’ll use Anthropic’s
Claude. For a nicely rendered output, we’ll also load rich.

import anthropic
import os
from rich.console import Console
from rich.markdown import Markdown
from langchain_community.document_loaders import TextLoader
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(usecwd=True))

Listing 6.23 Prompt Caching: Required Packages (Source: 06_RAG/40_prompt_caching.py)

At the time of writing, prompt caching is a beta feature not well integrated into Lang-
Chain. This integration is coming, but for now, you’ll need to interact directly with the
Anthropic client, as follows:

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

The complete prompt caching functionality is bundled into a class, as shown in Listing
6.24. We’ll develop it step by step.

class PromptCachingChat:
def __init__(self, initial_context: str):
self.messages = []
self.context = None
self.initial_context = initial_context

Listing 6.24 Prompt Caching: Class Initialization (Source: 06_RAG/40_prompt_caching.py)

First, we create the class. During the instantiation of the class object, we pass some
initial_context as a string. The other properties (messages and context) are initialized
as an empty list and as None, respectively; both will be populated in the process.
We’ll set up several methods in this exercise. As shown in Listing 6.25, the method run_
model will invoke the LLM.

252 252
6 Retrieval-Augmented Generation
6.4 Coding: Prompt Caching

def run_model(self):
self.context = client.beta.prompt_caching.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You’re a patent expert. You’re given a patent and will be asked 6
to answer questions about it.\n",
},
{
"type": "text",
"text": f"Initial Context: {self.initial_context}",
"cache_control": {"type": "ephemeral"}
}
],
messages=self.messages,
)
# add the model response to the messages
self.messages.append({"role": "assistant",
"content": self.context.content[0].text})
return self.context

Listing 6.25 Prompt Caching: Run Model Method (Source: 06_RAG/40_prompt_caching.py)

The LLM is called with a system message. But the system prompt does not only hold the
information on how the model should behave and answer. Additionally, the initial con-
text is passed, alongside a cache_control parameter that will enable the caching. This
parameter requires the type ephemeral.
The messages need to be passed as well. The model’s response is stored in the object
self.context. As soon as the model output is available, the messages are appended with
it.
Good! But we want the user to interact with it, so we need to fetch the user_query and
call our method run_model. As shown in Listing 6.26, our next method user_turn con-
sumes the user_query as an input.

def user_turn(self, user_query: str):


self.messages.append({"role": "user", "content": user_query})
self.context = self.run_model()
return self.context

Listing 6.26 Prompt Caching: User Method (Source: 06_RAG/40_prompt_caching.py)

253
253
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

The messages object is appended with the user message, and in the next step, the run_
model method is called. We’ll keep the model output in the context.
Basically, we’re done. We have a system that reacts to user queries and calls the model.
But for nicer outputs, we’ll finally define a method called show_model_response, as
shown in Listing 6.27.

def show_model_response(self):
console = Console()
console.print(Markdown(self.messages[-1]["content"]))
console.print(f"Usage: {self.context.usage}")

Listing 6.27 Prompt Caching: Showing Model Output (Source: 06_RAG/40_prompt_cach-


ing.py)

In this function, we’ll use the rich package and print the last model output in Mark-
down. Additionally, we will display the token usage to check if the prompt caching actu-
ally worked.
Let’s test this function out. In the first step, we need to load our book. The code shown
in Listing 6.28 provides access to the book file and loads it with TextLoader.

file_path = os.path.abspath(__file__)
current_dir = os.path.dirname(file_path)
parent_dir = os.path.dirname(current_dir)

file_path = os.path.join(parent_dir, "05_VectorDatabases",


"data","HoundOfBaskerville.txt")
file_path

#%% (3) Load a single document


text_loader = TextLoader(file_path=file_path, encoding="utf-8")
doc = text_loader.load()
initialContext = doc[0].page_content

Listing 6.28 Prompt Caching: Test (Source: 06_RAG/40_prompt_caching.py)

Now, we can create an instance of our class, the object promptCachingChat. During ini-
tialization, we pass the initialContext. Then, we start the conversation with user_turn
and send our first query, as follows:

promptCachingChat = PromptCachingChat(initial_context=initialContext)
promptCachingChat.user_turn("what is special about the hound of baskerville?")
promptCachingChat.show_model_response()

The result is shown in Figure 6.10.

254 254
6 Retrieval-Augmented Generation
6.4 Coding: Prompt Caching

Figure 6.10 Prompt Caching: First Chat Round with Cache Creation

The model provides a detailed output, but what is more interesting is the fact that
93,104 cache tokens were created. In the subsequent chat, we’ll see if the caching
worked and the cached tokens are used. Let’s now ask if the hound is the murderer, as
follows:

promptCachingChat.user_turn("Is the hound the murderer?")


promptCachingChat.show_model_response()
print(promptCachingChat.context.usage)

Again, the result is shown in Figure 6.11.

Figure 6.11 Prompt Caching with Reading Cached Tokens

255
255
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

No further cached tokens were created, and all cached tokens were read. Amazing! The
model provides the result much more quickly. Was it cheaper than without caching?
We’ll use the assumptions that normal tokens cost 1 unit, that cache creation costs 1.25
units, and that reading cached input tokens 0.1 units.
In the first round of the conversation, prompt caching was more expensive than with-
out caching, but the tide starts turning in the second round because reading from the
cache is pretty cheap compared to using normal tokens.
Thus concludes our discussion of prompt caching. Next, you’ll learn how to measure
the performance of your RAG systems.

6.5 Evaluation
RAG systems work surprisingly well, but what does “well” actually mean? And how can
we compare the effectiveness, accuracy, and efficiency of different RAG approaches?
For this reason, RAG evaluation techniques try to measure the following parameters:
 Effectiveness
Does our RAG system retrieve and use the relevant knowledge?
 Accuracy
How good are the generated responses? Are they grounded in facts and coherent?
 Efficiency
How well does our system perform in different scenarios?

To make all this measurable and comparable, different metrics can be applied. So many
metrics exist that only a small number can be presented.
First, we’ll look at some of the challenges that RAG systems face. Based on these find-
ings, we’ll discuss some metrics and how they can make these challenges measurable.
After several metrics have been presented, we’ll implement them in Python. Let’s start
with a closer look at some of the challenges RAG systems face.

6.5.1 Challenges in RAG Evaluation


How can we measure the quality of an answer? Some hard facts can be checked, for
instance, how good the data retrieved from the knowledge source is. We’ll focus on two
challenges:
 Retrieval quality
If the retriever fetches incorrect or irrelevant documents, the RAG system is likely
incorrect. A cascading effect occurs in that RAG quality is inherently tied to the qual-
ity of the retrieved data. Luckily, several metrics are available for determining
retrieval quality.

256 256
6 Retrieval-Augmented
6.5 Generation
Evaluation

 Subjectivity
If you ask multiple human evaluators, they will provide different ratings on a RAG
response. Everybody has personal preferences regarding brevity and style. For
example, a technical expert might prioritize precision and factual accuracy, whereas
a layperson might prefer readability and simplicity.
Always keep in mind the target user of your system and ensure your evaluators are
trained to get consistent results.
6

6.5.2 Metrics
Earlier when we discussed RAG improvements, we broke down the system into its dif-
ferent parts—retrieval, and generation. We can follow the same approach when it
comes to metrics because metrics exist for the different stages. Some are considered
joint metrics. Figure 6.12 shows some RAG evaluation metrics and how they can be clas-
sified into retriever, generator, and joint metrics.

Evaluation
Metrics

Joint
Retriever Generator
(Retriever+Generator)

Context Factual Answer


Precision@k Recall@k F1 Score
Relevance Correctness Relevance

Figure 6.12 RAG Evaluation Metrics

Among retriever metrics, classic classification metrics include precision@k and


recall@k. Another metric that we’ll study is context relevance.
Of the generator-based metrics, we want to highlight factual correctness. Finally, joint
metrics try to measure both the retriever and the generator quality in a combined met-
ric. Some examples include answer relevance and the F1 score. Let’s look at some of
these metrics in more detail next.

Retriever Metric: Context Precision


Context precision is a retriever metric for evaluating the ground-truth items and deter-
mining whether they are part of the context information. In perfect cases, all relevant
chunks returned by the retriever can be found at the top ranks.
Context precision uses the user query, a ground truth, and the context information for
its calculation. The resulting score is between 0 and 1, with higher values representing
higher precision:

257
257
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

K corresponds to the number of chunks in the context, and vk corresponds to the rele-
vance indicator at rank k.

Generator Metric: Factual Correctness


Factual correctness measures whether the information in a RAG response aligns with
established facts. Different aspects must be considered in this task, such as the follow-
ing:
 Fact granularity
A response might be partially correct, meaning correct in one aspect, but incorrect in
another. A correct statement is “Shakespeare wrote many plays,” but it would be
incorrect to state “Shakespeare wrote the play Waiting for Godot.”
 Consistency over time
The measure might have a temporal aspect, in that a fact might be correct for a given
point in time, but incorrect for another one. For example, answering the question
“Who is the president of country X?” depends on when the question is asked since
the fact could change after every election.
 Attribution
RAG responses should attribute facts to the correct source. For example, it would be
wrong to attribute the theory of special relativity to Newton instead of Einstein.

Joint Metric: Answer Faithfulness


Answer faithfulness measures the factual consistency of the RAG-generated answer
against the context information. This metric uses the RAG answer and the retrieved
context information for its calculation:

The score ranges from 0 to 1 with lower values corresponding to lower faithfulness, and
vice versa.

Joint Metric: Answer Relevance


Answer relevance evaluates the alignment of the generated answer with the user
query. This metric considers the retrieved context and measures how well the retrieval
and generator work together. Figure 6.13 shows the RAG evaluation metric concept
answer relevance.
This approach follows the classic RAG approach to receive an RAG answer. Based on
that RAG answer, an LLM creates N questions, that potentially could result in that RAG
answer. These N synthetic questions are then embedded and compared to the embed-
ding of the user query. The average of these similarities is the answer relevance.

258 258
6 Retrieval-Augmented
6.5 Generation
Evaluation

User Query Context RAG Answer

Data Source LLM

Synthetic
Questions
Q1 … Qn Generate N Questions
Query Sentence LLM
that Generate the 6
Embedding Embeddings RAG Answer.

Answer
Similarity Relevance

Figure 6.13 RAG Evaluation Metric: Answer Relevance

In technical terms, the answer relevance is calculated with the following equation:

This value ranges from 0 to 1, in which low values correspond to answers that are
incomplete or that hold redundant information. High values correspond to more rele-
vant answers.

6.5.3 Coding: Metrics


Now, let’s implement these metrics. For this task, we’ll use a Python framework that
has been developed specifically for evaluating RAG systems, namely, RAGAS (https://
www.ragas.io/). You can easily install it via uv, as follows:

uv add ragas

RAGAS has a nice web interface that you can use to evaluate larger datasets and RAG
responses with derived metrics. If you work on a “real” project with business impact, we
highly recommend diving deeply into RAGAS, starting with its Getting Started section
(https://docs.ragas.io/en/v0.1.21/getstarted/index.html#).
In this book, we’ll show only the basics of how to set up RAGAS to calculate RAG evalu-
ation metrics. What you need is the following:
 user_query or question as an input to the RAG system
 contexts that represent the retrieved documents from the knowledge source
 answer for the RAG response
 ground_truth is required only for some metrics like context precision. This informa-
tion is used to check whether the retrieved chunk is relevant to arrive at the ground
truth.

259
6 Retrieval-Augmented Generation 6 Retrieval-Augmented Generation

The complete script can be found in 06_RAG\60_rag_eval.py. As shown in Listing 6.29,


we start by loading the required packages.

#%% packages
from datasets import Dataset
from ragas.metrics import context_precision, answer_relevancy, faithfulness
from ragas import evaluate
from langchain_openai import ChatOpenAI

Listing 6.29 RAG Evaluation: Required Packages (Source: 06_RAG\60_rag_eval.py)

We must bring the data to be evaluated into a format that RAGAS can process. For this
step, we load Dataset. We also need some metrics from ragas as well as its evaluate
function.
Now, we must prepare our data. The data preparation step is shown in Listing 6.30. The
data must be passed in the format Dataset, which we can create based on a dictionary.
That dictionary requires the keys question, contexts, answer, and ground_truth.

# %% prepare dataset
my_sample = {
"question": ["What is the capital of Germany in 1960?"],
"contexts": [
[
"Berlin is the capital of Germany since 1990.",
"Between 1949 and 1990, East Berlin was the capital of East
Germany.",
"Bonn was the capital of West Germany during the same period."
]
], # Nested list for multiple contexts
"answer": ["In 1960, the capital of Germany was Bonn. East Berlin was the
capital of East Germany."],
"ground_truth": ["Berlin"]
}

dataset = Dataset.from_dict(my_sample)

Listing 6.30 RAG Evaluation: Dataset (Source: 06_RAG\60_rag_eval.py)

Our example is centered around Berlin, the capital of Germany. The question is tricky
because time inconsistency exists because Berlin was not always the capital. Finally, we
can start the evaluation process, as shown in Listing 6.31.

# %% evaluation
llm = ChatOpenAI(model="gpt-4o-mini")
metrics = [context_precision, answer_relevancy, faithfulness]
res = evaluate(dataset=dataset,

260 260
6 Retrieval-Augmented
6.6 Generation
Summary

metrics=metrics,
llm=llm)
res

{'context_precision': 1.0000, 'answer_relevancy': 0.9934, 'faithfulness':


0.5000}

Listing 6.31 RAG Evaluation: Evaluation (Source: 06_RAG\60_rag_eval.py)


6
We must still set up an LLM instance and pass it to the evaluate function. Other param-
eters that need to be passed include the dataset and the metrics that should be ana-
lyzed. The result holds the scores for these metrics. For a larger dataset, you can analyze
these metrics for all data to find issues in your RAG system, so that you can improve
the system further.
For now, we’ve concluded our short introduction to the evaluation of RAG systems.

6.6 Summary
In this chapter, we delved into the concept of RAG. This powerful tool can greatly assist
you when you need to enhance an LLM response with your own content or documents.
In most cases, this approach is more promising and preferred to fine-tuning a complete
LLM.
We started with understanding the implementation of naïve (or standard) RAG, which
consists of the retrieval, augmentation, and generation process. Later, we studied more
detailed techniques to improve the system at different stages, including improvement
strategies for pre-retrieval, in which the indexing pipeline is adapted in different ways.
In advanced retrieval techniques, you learned how a hybrid RAG system functions.
Other techniques like query expansion and context enrichment were presented.
During advanced post-retrieval techniques, you learned about prompt compression as
a means to remove irrelevant tokens from an LLM request.
Finally, we studied several RAG evaluation metrics, and you learned how to implement
these metrics to measure the quality of your RAG systems.
In conclusion, RAG represents a significant advancement in the field of artificial intel-
ligence. By merging retrieval and generative techniques, RAG opens new possibilities
for creating high-quality, contextually rich outputs. This chapter equipped you with a
comprehensive understanding of the concept, experimental results, and practical
applications, preparing you to implement this powerful tool in your own projects.

261
261
Bert Gollnick

Generative AI
with Python
The Developer’s Guide to Pretrained LLMs,
Vector Databases, Retrieval-Augmented
Generation, and Agentic Systems

■ Work with pretrained LLM and NLP


models on Hugging Face and LangChain
■ Create vector databases and implement
retrival-augmented generation
■ Add an agentic system using frameworks
such as CrewAI and AG2

rheinwerk-computing.com/6057

We hope you have enjoyed this reading sample. You may


recommend or pass it on to others, but only in its entirety,
including all pages. This reading sample and all its parts
are protected by copyright law. All usage and exploitation
rights are reserved by the author and the publisher.

Bert Gollnick is a senior data scientist who specializes in renewable energies.


For many years, he has taught courses about data science and machine learn-
ing, and more recently, about generative AI and natural language processing.

ISBN 978-1-4932-2690-0 • 392 pages • 05/2025


E-book: $54.99 • Print book: $59.95 • Bundle: $69.99

You might also like