Eval Framework Sandbox

A simple Q&A bot for technical documentation designed to test and compare different LLM evaluation frameworks including DeepEval, LangChain Evaluation, RAGAS, and OpenAI Evals.

Purpose

This project serves as a testbed for comparing how different evaluation frameworks assess the same RAG (Retrieval-Augmented Generation) system.

What is an evaluation framework?

An evaluation framework is a tool or library that systematically measures how well an AI system performs. In this sandbox, frameworks like DeepEval, LangChain Evaluation, and RAGAS take each question, the bot's answer, and the ground-truth answer, then compute metrics such as correctness, relevance, and hallucination rate. This lets you compare models, prompts, and retrieval strategies using repeatable, objective scores instead of ad-hoc manual judgment.

Key Concepts

This project evaluates a RAG (Retrieval-Augmented Generation) system. Here are a few key concepts to help you understand the components:

RAG (Retrieval-Augmented Generation): This is a technique where a large language model's knowledge is supplemented with information retrieved from other sources (in this case, our local documents). The process has two main steps:
1. Retrieval: A search algorithm (like TF-IDF) finds relevant documents based on the user's query.
2. Generation: A language model takes the retrieved documents and the original query to generate a comprehensive answer.
Ground Truth: In the context of evaluation, "ground truth" refers to the ideal or perfect answer to a given question. We use the ground truth dataset (data/ground_truth.json) as a benchmark to measure how accurate and relevant the Q&A bot's answers are.
TF-IDF (Term Frequency-Inverse Document Frequency): This is the retrieval algorithm used by the Q&A bot to find relevant documents. It works by assigning a score to each word in a document based on two factors:
- Term Frequency (TF): How often a word appears in a specific document.
- Inverse Document Frequency (IDF): How rare or common the word is across all documents.
This allows the system to prioritize words that are important to a specific document over common words that appear everywhere (like "the" or "and").
Snippet Selection: After TF-IDF retrieves the most relevant document, the QA bot selects the single line within that document that best matches the user's question. This is done using a simple, domain-agnostic heuristic: it counts how many words from the question appear in each line and returns the line with the highest overlap. No hardcoded patterns (like "pip install") are used, making the bot applicable to any domain—medical, legal, technical, etc.

Quick Start

Clone the repository

git clone https://github.com/LiteObject/eval-framework-sandbox.git
cd eval-framework-sandbox

Install dependencies

pip install -r requirements.txt

Set up environment variables

cp .env.example .env
# Edit .env with your API keys (optional unless running remote evals)

Ask a question

python -m src.main "How do you install the Python requests library?"

The bot will print a synthesized answer and list matching documents.

Run the unit tests

pytest

(Optional) Try an evaluation framework
- Update .env with the relevant API keys or enable the Ollama flag for a local model (details below).
- Install extras: pip install -r requirements.txt already includes optional libs, or pip install .[eval] after editable install.
- Use the runner scripts in evaluations/ as starting points; each script writes results into results/.


Using a local Ollama model with LangChain evaluation

The core QA bot already runs fully offline using TF-IDF retrieval. If you also want LangChain's evaluators to call a local Ollama model instead of OpenAI:

Install Ollama and pull a model, e.g. ollama pull llama3.
Set the following environment variables (via .env or your shell):

LANGCHAIN_USE_OLLAMA=true
OLLAMA_MODEL=llama3 (or any other pulled model)
Optionally OLLAMA_BASE_URL=http://localhost:11434 if you're running Ollama on a non-default host/port.


Leave OPENAI_API_KEY blank; the LangChain evaluator will detect the Ollama flag and use ChatOllama.

If LANGCHAIN_USE_OLLAMA is false, the evaluator falls back to ChatOpenAI and expects a valid OPENAI_API_KEY plus LANGCHAIN_OPENAI_MODEL (defaults to gpt-3.5-turbo).
Understanding the Evaluation Process

The evaluation workflow follows four distinct steps:
Step 1: Initialize Your QA Bot

from src.qa_bot import QABot

bot = QABot(documents_path="data/documents/sample_docs")
Load your bot with the technical documentation it will search through.
Step 2: Generate Predictions

import json
from pathlib import Path

questions = json.loads(Path("data/test_questions.json").read_text())
predictions = {}

for item in questions:
    answer = bot.answer(item["question"])
    predictions[item["id"]] = answer.response
Ask your bot each test question and collect its answers as predictions.
Step 3: Build the Evaluation Dataset

from evaluations.utils import load_dataset_from_files

dataset = load_dataset_from_files(
    questions_path=Path("data/test_questions.json"),
    ground_truth_path=Path("data/ground_truth.json"),
    predictions=predictions,  # Your bot's answers
)
This pairs each question with:

Prediction: Your bot's answer
Ground Truth: The expected correct answer

Step 4: Score with an Evaluation Framework

from evaluations.langchain_eval_runner import LangChainEvalRunner

runner = LangChainEvalRunner()
result = runner.evaluate(dataset)
print(f"Score: {result.score}")
The framework compares your predictions against ground truth and returns a score (0-1 scale).
Visual Flow

Visual Flow


  
    
      flowchart TD
    %% Step 1: Initialize
    D[Sample Docs] --> Q[QA Bot]
    
    %% Step 2: Generate Predictions
    T[Test Questions] --> Q
    Q --> P[Predictions]
    
    %% Step 3: Build Dataset
    T --> EDS[Evaluation Dataset]
    P --> EDS
    G[Ground Truth] --> EDS
    
    %% Step 4: Evaluate
    EDS --> LC[LangChain<br/>Evaluator]
    EDS --> DE[DeepEval<br/>Evaluator]
    EDS --> RG[RAGAS<br/>Evaluator]
    EDS --> EM[Embedding<br/>Evaluator]
    
    LC --> S1[LLM Score]
    DE --> S2[Word Overlap Score]
    RG --> S3[Token Overlap Score]
    EM --> S4[Cosine Similarity Score]
    
    S1 --> COMP[Score Comparison<br/>& Bar Chart]
    S2 --> COMP
    S3 --> COMP
    S4 --> COMP

    
  
  
    
      Loading

  



Evaluation Frameworks

These integrations are opt-in. Install the additional dependencies with:
pip install .[eval]
How Each Framework Evaluates

Each framework uses a different method to score your bot's answers:



Framework
Evaluation Method
Requires LLM?
Understands Meaning?
Speed




LangChain
LLM-based grading
✅ Yes
✅ Yes
Slower


DeepEval
Word overlap (Jaccard)
❌ No
❌ No
Fast


RAGAS
Token overlap (Jaccard)
❌ No
❌ No
Fast



LangChain: LLM-Based Grading

Uses a language model (OpenAI GPT or local Ollama) to judge answer quality by comparing the question, bot prediction, and ground truth. The LLM acts as an intelligent grader that understands semantic similarity and paraphrasing.
Example: If the bot says "utilize pip" and ground truth says "use pip", LangChain recognizes these as equivalent.
DeepEval: Word-Overlap Scoring

Calculates Jaccard similarity between words in the prediction and ground truth:
score = |words in common| / |total unique words|

Example:

Prediction: "Install the requests library"
Ground truth: "Use pip install requests"
Score: 2/6 ≈ 0.333 (only "install" and "requests" match)

RAGAS: Token-Overlap Scoring

Similar to DeepEval but operates at the token level. Also uses Jaccard similarity for word matching.
Why Scores Differ

When running the comparison notebook, you might see:
LangChain: 0.667
DeepEval:  0.333
RAGAS:     0.250

This happens because LangChain understands semantics (recognizes synonyms, paraphrases) while DeepEval/RAGAS only count exact word matches. The bar chart in the notebook helps visualize these differences.
Recommendation:

Use LangChain for semantic accuracy evaluation (more human-like judgment)
Use DeepEval/RAGAS for quick keyword coverage checks and sanity testing

LLM as a Judge

LangChain Evaluation implements the "LLM as a Judge" pattern—a popular evaluation approach where a language model grades another AI system's outputs.
How It Works


The judge LLM receives:

The original question
Your bot's answer (prediction)
The correct answer (ground truth)


The judge evaluates whether the prediction is accurate, relevant, and semantically similar to the ground truth
Returns a score (0.0 = poor, 1.0 = perfect)

Why Use LLM as a Judge?


Semantic understanding: Recognizes that "utilize pip" and "use pip" mean the same thing
Nuanced judgment: Can evaluate tone, completeness, and style—not just word matches
Flexible: Works across domains without hardcoded rules

Tradeoffs


Requires LLM access: Either OpenAI API key (costs money) or local Ollama (free but slower)
Non-deterministic: Same input may get slightly different scores on repeated runs
Judge biases: The evaluator inherits biases from the judge LLM itself

Contrast with Rule-Based Methods:

DeepEval and RAGAS use deterministic formulas (Jaccard similarity)
Faster and cheaper, but miss semantic nuances
Best for quick sanity checks rather than comprehensive evaluation


LangChain Evaluation



Choose your backend:

Remote OpenAI models: set OPENAI_API_KEY and optionally
LANGCHAIN_OPENAI_MODEL (defaults to gpt-3.5-turbo).
Local Ollama: set LANGCHAIN_USE_OLLAMA=true, OLLAMA_MODEL, and
optionally OLLAMA_BASE_URL; no OpenAI key required.



Invoke the runner:
from evaluations.langchain_eval_runner import LangChainEvalRunner

runner = LangChainEvalRunner()
result = runner.evaluate(dataset)
print(result.score, result.details)
LangChain will call the configured chat model to grade responses and store
the output at results/langchain_result.json.


DeepEval

DeepEval now uses offline word-overlap scoring (Jaccard similarity) and requires no API keys or LLM calls.
Run the runner programmatically:
from evaluations.deepeval_runner import DeepEvalRunner

runner = DeepEvalRunner()
result = runner.evaluate(dataset)
print(result.score, result.details)
The report is written to results/deepeval_result.json.
RAGAS

RAGAS now uses offline token-overlap scoring (Jaccard similarity) and requires no API keys or LLM calls.
Evaluate the dataset:
from evaluations.ragas_runner import RagasRunner

runner = RagasRunner()
result = runner.evaluate(dataset)
print(result.score, result.details)
The raw metric results are saved to results/ragas_result.json.
OpenAI Evals

This repository only prepares the dataset and relies on OpenAI's CLI for the
actual evaluation. Ensure evals is installed and OPENAI_API_KEY is set, then
use evaluations/openai_eval_runner.py to export a dataset and follow the
OpenAI Evals documentation to launch the
experiments with oaieval.
Project Structure


data/: Test questions, ground truth, and source documents
src/: Core Q&A bot implementation
evaluations/: Framework-specific evaluation scripts
results/: Evaluation results and comparisons (gitignored except for .gitkeep)

Metrics Evaluated


Answer Correctness
Context Relevance
Faithfulness
Answer Similarity
Response Time
Hallucination Rate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Eval Framework Sandbox

Purpose

What is an evaluation framework?

Key Concepts

Quick Start

Using a local Ollama model with LangChain evaluation

Understanding the Evaluation Process

Step 1: Initialize Your QA Bot

Step 2: Generate Predictions

Step 3: Build the Evaluation Dataset

Step 4: Score with an Evaluation Framework

Visual Flow

Visual Flow

Evaluation Frameworks

How Each Framework Evaluates

LangChain: LLM-Based Grading

DeepEval: Word-Overlap Scoring

RAGAS: Token-Overlap Scoring

Why Scores Differ

LLM as a Judge

How It Works

Why Use LLM as a Judge?

Tradeoffs

LangChain Evaluation

DeepEval

RAGAS

OpenAI Evals

Project Structure

Metrics Evaluated

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
docs		docs
evaluations		evaluations
notebooks		notebooks
results		results
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Framework	Evaluation Method	Requires LLM?	Understands Meaning?	Speed
LangChain	LLM-based grading	✅ Yes	✅ Yes	Slower
DeepEval	Word overlap (Jaccard)	❌ No	❌ No	Fast
RAGAS	Token overlap (Jaccard)	❌ No	❌ No	Fast

LiteObject/eval-framework-sandbox

Folders and files

Latest commit

History

Repository files navigation

Eval Framework Sandbox

Purpose

What is an evaluation framework?

Key Concepts

Quick Start

Using a local Ollama model with LangChain evaluation

Understanding the Evaluation Process

Step 1: Initialize Your QA Bot

Step 2: Generate Predictions

Step 3: Build the Evaluation Dataset

Step 4: Score with an Evaluation Framework

Visual Flow

Visual Flow

Evaluation Frameworks

How Each Framework Evaluates

LangChain: LLM-Based Grading

DeepEval: Word-Overlap Scoring

RAGAS: Token-Overlap Scoring

Why Scores Differ

LLM as a Judge

How It Works

Why Use LLM as a Judge?

Tradeoffs

LangChain Evaluation

DeepEval

RAGAS

OpenAI Evals

Project Structure

Metrics Evaluated

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages