Chapter 14 - Knowledge Retrieval (RAG)
Chapter 14 - Knowledge Retrieval (RAG)
For AI agents, this is crucial as it allows them to ground their actions and responses in
real-time, verifiable data beyond their static training. This capability enables them to
perform complex tasks accurately, such as accessing the latest company policies to
answer a specific question or checking current inventory before placing an order. By
integrating external knowledge, RAG transforms agents from simple conversationalists
into effective, data-driven tools capable of executing meaningful work.
When a user poses a question or gives a prompt to an AI system using RAG, the query
isn't sent directly to the LLM. Instead, the system first scours a vast external
knowledge base—a highly organized library of documents, databases, or web
pages—for relevant information. This search is not a simple keyword match; it's a
"semantic search" that understands the user's intent and the meaning behind their
words. This initial search pulls out the most pertinent snippets or "chunks" of
information. These extracted pieces are then "augmented," or added, to the original
prompt, creating a richer, more informed query. Finally, this enhanced prompt is sent
to the LLM. With this additional context, the LLM can generate a response that is not
only fluent and natural but also factually grounded in the retrieved data.
The RAG framework provides several significant benefits. It allows LLMs to access
up-to-date information, thereby overcoming the constraints of their static training
1
data. This approach also reduces the risk of "hallucination"—the generation of false
information—by grounding responses in verifiable data. Moreover, LLMs can utilize
specialized knowledge found in internal company documents or wikis. A vital
advantage of this process is the capability to offer "citations," which pinpoint the
exact source of information, thereby enhancing the trustworthiness and verifiability of
the AI's responses..
To fully appreciate how RAG functions, it's essential to understand a few core
concepts (see Fig.1):
Text Similarity: Text similarity refers to the measure of how alike two pieces of text
are. This can be at a surface level, looking at the overlap of words (lexical similarity),
or at a deeper, meaning-based level. In the context of RAG, text similarity is crucial for
finding the most relevant information in the knowledge base that corresponds to a
user's query. For instance, consider the sentences: "What is the capital of France?"
and "Which city is the capital of France?". While the wording is different, they are
asking the same question. A good text similarity model would recognize this and
assign a high similarity score to these two sentences, even though they only share a
few words. This is often calculated using the embeddings of the texts.
2
recognize that they refer to the same thing and would consider them to be highly
similar. This is because their embeddings would be very close in the vector space,
indicating a small semantic distance. This is the "smart search" that allows RAG to find
relevant information even when the user's wording doesn't exactly match the text in
the knowledge base.
3
retrieval process faster and the information provided to the LLM more focused and
relevant to the user's immediate need. Once documents are chunked, the RAG system
must employ a retrieval technique to find the most relevant pieces for a given query.
The primary method is vector search, which uses embeddings and semantic distance
to find chunks that are conceptually similar to the user's question. An older, but still
valuable, technique is BM25, a keyword-based algorithm that ranks chunks based on
term frequency without understanding semantic meaning. To get the best of both
worlds, hybrid search approaches are often used, combining the keyword precision of
BM25 with the contextual understanding of semantic search. This fusion allows for
more robust and accurate retrieval, capturing both literal matches and conceptual
relevance.
RAG's Challenges: Despite its power, the RAG pattern is not without its challenges. A
primary issue arises when the information needed to answer a query is not confined
to a single chunk but is spread across multiple parts of a document or even several
documents. In such cases, the retriever might fail to gather all the necessary context,
leading to an incomplete or inaccurate answer. The system's effectiveness is also
4
highly dependent on the quality of the chunking and retrieval process; if irrelevant
chunks are retrieved, it can introduce noise and confuse the LLM. Furthermore,
effectively synthesizing information from potentially contradictory sources remains a
significant hurdle for these systems. Besides that, another challenge is that RAG
requires the entire knowledge base to be pre-processed and stored in specialized
databases, such as vector or graph databases, which is a considerable undertaking.
Consequently, this knowledge requires periodic reconciliation to remain up-to-date, a
crucial task when dealing with evolving sources like company wikis. This entire
process can have a noticeable impact on performance, increasing latency, operational
costs, and the number of tokens used in the final prompt.
Use cases include complex financial analysis, connecting companies to market events,
and scientific research for discovering relationships between genes and diseases. The
primary drawback, however, is the significant complexity, cost, and expertise required
to build and maintain a high-quality knowledge graph. This setup is also less flexible
and can introduce higher latency compared to simpler vector search systems. The
system's effectiveness is entirely dependent on the quality and completeness of the
underlying graph structure. Consequently, GraphRAG offers superior contextual
reasoning for intricate questions but at a much higher implementation and
5
maintenance cost. In summary, it excels where deep, interconnected insights are more
critical than the speed and simplicity of standard RAG.
Agentic RAG: An evolution of this pattern, known as Agentic RAG (see Fig.2),
introduces a reasoning and decision-making layer to significantly enhance the
reliability of information extraction. Instead of just retrieving and augmenting, an
"agent"—a specialized AI component—acts as a critical gatekeeper and refiner of
knowledge. Rather than passively accepting the initially retrieved data, this agent
actively interrogates its quality, relevance, and completeness, as illustrated by the
following scenarios.
First, an agent excels at reflection and source validation. If a user asks, "What is our
company's policy on remote work?" a standard RAG might pull up a 2020 blog post
alongside the official 2025 policy document. The agent, however, would analyze the
documents' metadata, recognize the 2025 policy as the most current and
authoritative source, and discard the outdated blog post before sending the correct
context to the LLM for a precise answer.
6
Fig.2: Agentic RAG introduces a reasoning agent that actively evaluates, reconciles,
and refines retrieved information to ensure a more accurate and trustworthy final
response.
Fourth, an agent can identify knowledge gaps and use external tools. Suppose a user
asks, "What was the market's immediate reaction to our new product launched
yesterday?" The agent searches the internal knowledge base, which is updated
weekly, and finds no relevant information. Recognizing this gap, it can then activate a
tool—such as a live web-search API—to find recent news articles and social media
sentiment. The agent then uses this freshly gathered external information to provide
an up-to-the-minute answer, overcoming the limitations of its static internal database.
Challenges of Agentic RAG: While powerful, the agentic layer introduces its own set
of challenges. The primary drawback is a significant increase in complexity and cost.
Designing, implementing, and maintaining the agent's decision-making logic and tool
integrations requires substantial engineering effort and adds to computational
expenses. This complexity can also lead to increased latency, as the agent's cycles of
reflection, tool use, and multi-step reasoning take more time than a standard, direct
retrieval process. Furthermore, the agent itself can become a new source of error; a
flawed reasoning process could cause it to get stuck in useless loops, misinterpret a
task, or improperly discard relevant information, ultimately degrading the quality of
the final response.
7
In summary: Agentic RAG represents a sophisticated evolution of the standard
retrieval pattern, transforming it from a passive data pipeline into an active,
problem-solving framework. By embedding a reasoning layer that can evaluate
sources, reconcile conflicts, decompose complex questions, and use external tools,
agents dramatically improve the reliability and depth of the generated answers. This
advancement makes the AI more trustworthy and capable, though it comes with
important trade-offs in system complexity, latency, and cost that must be carefully
managed.
Applications include:
● Enterprise Search and Q&A: Organizations can develop internal chatbots that
respond to employee inquiries using internal documentation such as HR
policies, technical manuals, and product specifications. The RAG system
extracts relevant sections from these documents to inform the LLM's response.
● Customer Support and Helpdesks: RAG-based systems can offer precise and
consistent responses to customer queries by accessing information from
product manuals, frequently asked questions (FAQs), and support tickets. This
can reduce the need for direct human intervention for routine issues.
● Personalized Content Recommendation: Instead of basic keyword matching,
RAG can identify and retrieve content (articles, products) that is semantically
related to a user's preferences or previous interactions, leading to more
relevant recommendations.
● News and Current Events Summarization: LLMs can be integrated with
real-time news feeds. When prompted about a current event, the RAG system
retrieves recent articles, allowing the LLM to produce an up-to-date summary.
8
Hands-On Code Example (ADK)
To illustrate the Knowledge Retrieval (RAG) pattern, let's see three examples.
First, is how to use Google Search to do RAG and ground LLMs to search results.
Since RAG involves accessing external information, the Google Search tool is a direct
example of a built-in retrieval mechanism that can augment an LLM's knowledge.
search_agent = Agent(
name="research_assistant",
model="gemini-2.0-flash-exp",
instruction="You help users research topics. When asked, use the
Google Search tool",
tools=[google_search]
)
Second, this section explains how to utilize Vertex AI RAG capabilities within the
Google ADK. The code provided demonstrates the initialization of
VertexAiRagMemoryService from the ADK. This allows for establishing a connection to
a Google Cloud Vertex AI RAG Corpus. The service is configured by specifying the
corpus resource name and optional parameters such as SIMILARITY_TOP_K and
VECTOR_DISTANCE_THRESHOLD. These parameters influence the retrieval process.
SIMILARITY_TOP_K defines the number of top similar results to be retrieved.
VECTOR_DISTANCE_THRESHOLD sets a limit on the semantic distance for the
retrieved results. This setup enables agents to perform scalable and persistent
semantic knowledge retrieval from the designated RAG Corpus. The process
effectively integrates Google Cloud's RAG functionalities into an ADK agent, thereby
supporting the development of responses grounded in factual data.
RAG_CORPUS_RESOURCE_NAME =
"projects/your-gcp-project-id/locations/us-central1/ragCorpora/your-c
orpus-id"
9
to retrieve.
# This controls how many relevant document chunks the RAG service
will return.
SIMILARITY_TOP_K = 5
import os
import requests
from typing import List, Dict, Any, TypedDict
from langchain_community.document_loaders import TextLoader
10
import dotenv
loader = TextLoader('./state_of_the_union.txt')
documents = loader.load()
# Chunk documents
text_splitter = CharacterTextSplitter(chunk_size=500,
chunk_overlap=50)
chunks = text_splitter.split_documents(documents)
vectorstore = Weaviate.from_documents(
client = client,
documents = chunks,
embedding = OpenAIEmbeddings(),
by_text = False
)
# Initialize LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
11
documents: List[Document]
generation: str
workflow = StateGraph(RAGGraphState)
# Add nodes
workflow.add_node("retrieve", retrieve_documents_node)
12
workflow.add_node("generate", generate_response_node)
13
At Glance
What: LLMs possess impressive text generation abilities but are fundamentally limited
by their training data. This knowledge is static, meaning it doesn't include real-time
information or private, domain-specific data. Consequently, their responses can be
outdated, inaccurate, or lack the specific context required for specialized tasks. This
gap restricts their reliability for applications demanding current and factual answers.
Rule of thumb: Use this pattern when you need an LLM to answer questions or
generate content based on specific, up-to-date, or proprietary information that was
not part of its original training data. It is ideal for building Q&A systems over internal
documents, customer support bots, and applications requiring verifiable, fact-based
responses with citations.
Visual summary
14
Knowledge Retrieval pattern: an AI agent to query and retrieve information from
structured databases
15
Fig. 3: Knowledge Retrieval pattern: an AI agent to find and synthesize information
from the public internet in response to user queries.
Key Takeaways
16
● Agentic RAG moves beyond simple information retrieval by using an intelligent
agent to actively reason about, validate, and refine external knowledge, ensuring
a more accurate and reliable answer.
● Practical applications span enterprise search, customer support, legal research,
and personalized recommendations.
Conclusion
In conclusion, Retrieval-Augmented Generation (RAG) addresses the core limitation of
a Large Language Model's static knowledge by connecting it to external, up-to-date
data sources. The process works by first retrieving relevant information snippets and
then augmenting the user's prompt, enabling the LLM to generate more accurate and
contextually aware responses. This is made possible by foundational technologies like
embeddings, semantic search, and vector databases, which find information based on
meaning rather than just keywords. By grounding outputs in verifiable data, RAG
significantly reduces factual errors and allows for the use of proprietary information,
enhancing trust through citations.
References
1. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive
NLP Tasks. https://arxiv.org/abs/2005.11401
2. Google AI for Developers Documentation. Retrieval Augmented Generation -
https://cloud.google.com/vertex-ai/generative-ai/docs/rag-engine/rag-overv
iew
17
3. Retrieval-Augmented Generation with Graphs (GraphRAG),
https://arxiv.org/abs/2501.00309
4. LangChain and LangGraph: Leonie Monigatti, "Retrieval-Augmented Generation
(RAG): From Theory to LangChain Implementation,"
https://medium.com/data-science/retrieval-augmented-generation-rag-fro
m-theory-to-langchain-implementation-4e9bd5f6a4f2
5. Google Cloud Vertex AI RAG Corpus
https://cloud.google.com/vertex-ai/generative-ai/docs/rag-engine/manage-y
our-rag-corpus#corpus-management
18