llms-benchmarking

Star

Here are 65 public repositories matching this topic...

steel-dev / awesome-web-agents

Star

🔥 A list of tools, frameworks, and resources for building AI web agents

ai browser-automation ai-agents llms llms-benchmarking

Updated Apr 8, 2025

karthikv792 / LLMs-Planning

Star

An extensible benchmark for evaluating large language models on planning

planning pddl benchmark-suite llms llms-reasoning llms-benchmarking llms-planning

Updated Apr 24, 2025
PDDL

JonathanChavezTamales / llm-leaderboard

Star

A comprehensive set of LLM benchmark scores and provider prices.

llm llmops llm-evaluation llm-agents llms-benchmarking

Updated May 15, 2025
JavaScript

ChemFoundationModels / ChemLLMBench

Star

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

nlp benchmark chemistry ai4science large-language-models llm llms-benchmarking

Updated Jul 26, 2024
Jupyter Notebook

bboylyg / BackdoorLLM

Star

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

backdoor llms llms-benchmarking

Updated May 1, 2025
Python

lerogo / MMGenBench

Star

Official repository of MMGenBench

mllm llms-benchmarking mmgenbench

Updated Mar 8, 2025
Python

lechmazur / nyt-connections

Star

Benchmark that evaluates LLMs using 651 NYT Connections puzzles extended with extra trick words

testing benchmark evaluation puzzles reasoning llm llms-benchmarking gpt-4o sonnet3-7 gpt-4-5

Updated May 8, 2025
Python

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Feb 13, 2025
Python

lamalab-org / chembench

Star

How good are LLMs at chemistry?

benchmark machine-learning chemistry safety materials-science llm llms llms-benchmarking

Updated May 8, 2025
Python

FSoft-AI4Code / XMainframe

Star

Language Model for Mainframe Modernization

migration cobol mainframe code-summarization codellm llms-benchmarking

Updated Aug 23, 2024
Python

Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.

benchmark evaluation generalization llm llms llms-benchmarking sonnet3-7 gpt-4-5

Updated May 7, 2025

RaptorMai / MLLM-CompBench

Star

[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes