Build software better, together

mlflow / mlflow

The open source developer platform to build AI agents and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

open-source machine-learning ai apache-spark evaluation ml openai agents observability model-management mlops mlflow agentops prompt-engineering ai-governance langchain llmops llm-evaluation

Updated Nov 24, 2025
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai observability autogen large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

Updated Nov 24, 2025
TypeScript

comet-ml / opik

Star

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

open-source playground evaluation openai hacktoberfest llm prompt-engineering hacktoberfest2025 langchain llmops llama-index llm-evaluation llm-observability

Updated Nov 24, 2025
Python

confident-ai / deepeval

Star

The LLM Evaluation Framework

python hacktoberfest evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Nov 20, 2025
Python

promptfoo / promptfoo

Star

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Nov 24, 2025
TypeScript

Arize-ai / phoenix

Star

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

Updated Nov 24, 2025
Jupyter Notebook

NVIDIA / garak

Star

the LLM vulnerability scanner

ai vulnerability-assessment security-scanners llm-security llm-evaluation

Updated Nov 24, 2025
Python

jeinlee1991 / chinese-llm-benchmark

Star

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括303个大模型，覆盖chatgpt、gpt-5、o4-mini、谷歌gemini-2.5、Claude4.5、智谱GLM-Z1、文心一言、qwen3-max、百川、讯飞星火、商汤senseChat、minimax等商用模型，以及kimi-k2、ernie4.5、minimax-M1、DeepSeek-R1-0528、deepseek-v3.2、qwen3-2507、llama4、GLM4.5、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。

artificial-intelligence llm-agent llm-evaluation agentic-ai

Updated Nov 22, 2025

Giskard-AI / giskard-oss

Sponsor

Star

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Nov 18, 2025
Python

Helicone / helicone

Star

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

open-source playground monitoring analytics evaluation ycombinator openai gpt large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability agent-monitoring llm-cost

Updated Nov 24, 2025
TypeScript

Marker-Inc-Korea / AutoRAG

Star

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

python open-source qa benchmarking ops pipeline analysis optimization evaluation embeddings automl document-parser rag llm retrieval-augmented-generation llm-ops llm-evaluation rag-evaluation

Updated Nov 20, 2025
Python

PacktPublishing / LLM-Engineers-Handbook

Star

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

aws rag mlops llm llmops genai fine-tuning-llm llm-evaluation ml-system-design

Updated Mar 8, 2025
Python

Agenta-AI / agenta

Star

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

prompt-engineering prompt-management llm-tools llm-framework llm-playground llm-platform llm-evaluation rag-evaluation llm-monitoring llm-as-a-judge llm-observability llmops-platform

Updated Nov 24, 2025
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Nov 24, 2025
Python

lmnr-ai / lmnr

Star

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

Updated Nov 24, 2025
TypeScript

msoedov / agentic_security

Star

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

agent-framework ai-red-team prompt-testing llm-security llm-vulnerabilities llm-evaluation llm-fuzzing llm-evaluation-framework llm-guardrails llm-scanner llm-jailbreaks llm-fuzzer llm-fuzzer-aggregator agent-security

Updated Nov 15, 2025
Python

genieincodebottle / generative-ai

Star

Comprehensive resources on Generative AI, including a detailed roadmap, projects, use cases, interview preparation, and coding preparation.

Updated Nov 24, 2025
Jupyter Notebook

huggingface / aisheets

Star

Build, enrich, and transform datasets using AI models with no code

oss ai synthetic-data nocode llms llm-evaluation

Updated Oct 23, 2025
TypeScript

microsoft / prompty

Star

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

promptengineering llms generative-ai llm-evaluation prompty

Updated Nov 19, 2025
Python

cvs-health / uqlm

Star

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated Nov 21, 2025
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evaluation

Here are 367 public repositories matching this topic...

mlflow / mlflow

langfuse / langfuse

comet-ml / opik

confident-ai / deepeval

promptfoo / promptfoo

Arize-ai / phoenix

NVIDIA / garak

jeinlee1991 / chinese-llm-benchmark

Giskard-AI / giskard-oss

Helicone / helicone

Marker-Inc-Korea / AutoRAG

PacktPublishing / LLM-Engineers-Handbook

Agenta-AI / agenta

truera / trulens

lmnr-ai / lmnr

msoedov / agentic_security

genieincodebottle / generative-ai

huggingface / aisheets

microsoft / prompty

cvs-health / uqlm

Improve this page

Add this topic to your repo