evaluation-framework

Here are 325 public repositories matching this topic...

confident-ai / deepeval

The LLM Evaluation Framework

python evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Mar 10, 2026
Python

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Mar 11, 2026
TypeScript

EleutherAI / lm-evaluation-harness

Star

A framework for few-shot evaluation of language models.

transformer language-model evaluation-framework

Updated Mar 5, 2026
Python

Kiln-AI / Kiln

Star

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

Updated Mar 11, 2026
Python

huggingface / lighteval

Star

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation evaluation-metrics evaluation-framework huggingface

Updated Mar 9, 2026
Python

MaurizioFD / RecSys2019_DeepLearning_Evaluation

Star

This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

Updated May 25, 2023
Python

ServiceNow / AgentLab

Star

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

agent benchmark lab agents evaluation-framework web-agents llm prompting llm-agents

Updated Feb 10, 2026
Python

relari-ai / continuous-eval

Star

Data-Driven Evaluation for LLM-Powered Applications

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation 10BC0 llm-evaluation

Updated Jan 22, 2025
Python

lartpang / PySODEvalToolkit

Star

PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection

Updated Sep 27, 2024
Python

TonicAI / tonic_validate

Star

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

evaluation-metrics evaluation-framework rag large-language-models llm llms llmops retrieval-augmented-generation

Updated Jul 10, 2025
Python

aiverify-foundation / moonshot

Star

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

benchmarking evaluation-framework red-teaming trustworthy-ai llm

Updated Feb 5, 2026
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Jun 6, 2025
Python

JinjieNi / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Nov 10, 2024
Python

guestrin-lab / deepscholar

Star

build and benchmark deep research

dataset-generation benchmark-suite evaluation-framework ai-evaluation deep-research

Updated Feb 16, 2026
Python

diningphil / PyDGN

Star

A research library for automating experiments on Deep Graph Networks

evaluation-framework deep-graph-networks deep-learning-for-graphs

Updated Dec 16, 2025
Python

zeno-ml / zeno

Star

AI Data Management & Evaluation Platform

python data-science machine-learning ai evaluation evaluation-framework

Updated Oct 5, 2023
Svelte

alibaba-damo-academy / MedEvalKit

Star

MedEvalKit: A Unified Medical Evaluation Framework

evaluation-framework multimodal medicalai llm

Updated Feb 24, 2026
Python

symflower / eval-dev-quality

Star

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

evaluation software-development software-quality evaluation-framework llms

Updated May 15, 2025
Go

microsoft / eureka-ml-insights

Star

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

machine-learning ai artificial-intelligence evaluation-framework llm mllm

Updated Feb 26, 2026
Python

bijington / expressive

Sponsor

Star

Expressive is a cross-platform expression parsing and evaluation framework. The cross-platform nature is achieved through compiling for .NET Standard so it will run on practically any platform.

xamarin parsing cross-platform evaluation netstandard expression-parser expression-evaluator hacktoberfest evaluation-framework

Updated Oct 1, 2024
C#

Improve this page

Add a description, image, and links to the evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation-framework

Here are 325 public repositories matching this topic...

confident-ai / deepeval

promptfoo / promptfoo

EleutherAI / lm-evaluation-harness

Kiln-AI / Kiln

huggingface / lighteval

MaurizioFD / RecSys2019_DeepLearning_Evaluation

ServiceNow / AgentLab

relari-ai / continuous-eval

lartpang / PySODEvalToolkit

TonicAI / tonic_validate

aiverify-foundation / moonshot

athina-ai / athina-evals

JinjieNi / MixEval

guestrin-lab / deepscholar

diningphil / PyDGN

zeno-ml / zeno

alibaba-damo-academy / MedEvalKit

symflower / eval-dev-quality

microsoft / eureka-ml-insights

bijington / expressive

Improve this page

Add this topic to your repo