"Lord, forgive me, I've been running, running blind in truth."
— Beyonce / "Freedom" / Lemonade
ADE-bench1 is a framework for evaluating AI agents on data analyst tasks. Use it to create realistic testing environments with complex evaluation criteria, that can be shared as simple files and tested in production warehouses, while making it as easy as possible to create new tasks.
Watch an explainer video from Coalesce 2025
It does this by running each task in a separate temporary sandbox, and then evaluating the agent's work against the expected results.
git clone https://github.com/dbt-labs/ade-bench.git
cd ade-bench2. Install Docker Compose and uv
You might already have these, check with:
uv --version
docker compose versionIf you don't, install them according to the links above.
Reminder: cd into the ade-bench directory first.
uv venv
source .venv/bin/activate
uv pip install -e .Confirm it is installed by running:
ade --helpLearn more about virtual environments, including activation instructions for Windows.
4. Download the sample DuckDB databases to /shared/databases/duckdb
Or run the following command:
uv run --with gdown gdown --folder https://drive.google.com/drive/folders/1CNS_8mf81to02868HA-celmcPEFu4BPE -O shared/databases/duckdbade run simple001 --db duckdb --project-type dbt --agent sageYou should see output like this:
$ ade run simple001 --db duckdb --project-type dbt --agent sage 14:58:07 | system | START | STARTING HARNESS RUN 14:58:07 | system | START | Run ID: 2025-12-05__14-58-07 14:58:07 | system | START | Running with duckdb + dbt. Found 1 tasks of 1 requested tasks. ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Time ┃ Task ┃ Stage ┃ Message ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 14:58:32 │ simple001 │ ✓ ✓ ✓ DONE │ PASS - dbt test results - Pass: 2, Fail: 0, Total: 2 │ │ ──────── │ ──────────────────────────────── │ ──────────── │ ──────────────────────────────────────────────────────────────────────────────────────────────────── │ │ 14:58:32 │ SUMMARY │ │ Run 1 of 1 tasks, 1 successful (100.0%) │ └──────────┴──────────────────────────────────┴──────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘ 14:58:32 | system | FINISH | Generated HTML dashboard. 14:58:32 | system | FINISH | Harness run completed: 1 of 1 tasks successful ======================================== RESULTS SUMMARY ======================================== +-------------+----------+----------------+---------+----------+------------+------------+--------+----------------+-----------------+----------------+---------+ | Task | Result | Failure Type | Tests | Passed | Passed % | Time (s) | Cost | Input Tokens | Output Tokens | Cache Tokens | Turns | |-------------+----------+----------------+---------+----------+------------+------------+--------+----------------+-----------------+----------------+---------| | simple001 | p | | 2 | 2 | | 0 | $0.00 | 0 | 0 | 0 | 0 | | | | | | | | | | | | | | | TOTAL (n=1) | 100% | | 2 | 2 | 100% | 0 | $0.00 | 0 | 0 | 0 | 0 | +-------------+----------+----------------+---------+----------+------------+------------+--------+----------------+-----------------+----------------+---------+ For more details, run the command below: ade view
Make a copy of .env.example and rename it to .env, then provide your preferred API keys in the specified locations. See Configuration for a worked example.
ade run all --db duckdb --project-type dbt --agent claude- Use the dbt Fusion engine instead of dbt Core with
--project-type dbt-fusion(set up Snowflake first) - Enable the dbt MCP server with the
--use-mcpflag (requires Snowflake, see MCP section) - Contribute additional tasks or datasets
ADE-bench has three primary components:
- Tasks
- Shared databases
- Shared dbt projects
Each task represents a request that might be made to an agent. Though each task can include multiple evaluation criteria (e.g., it may require multiple models to be updated, or a model to have the correct SQL query and materialization configuration), the task is the primary unit of evaluation in ADE-bench.
When ADE-bench is asked to solve a task, here is what happens:
- Copy the project into a sandbox. ADE-bench creates a sandbox environment (i.e., a Docker container) for the task. It loads the corresponding project into the container, and creates a sandbox environment for the corresponding database (see "How databases work" below).
- Take a first snapshot. Once the project is set up, ADE-bench takes a snapshot of all the files in the project. This allows it to log the changes made by additional setup tasks and by the agent.
- Run task-specific setup script. After taking a snapshot, ADE-bench runs any of the task's additional setup scripts. These might make changes to the project, update the data in the database, or make changes to the project so that it can be run against a different type of databases (see "Sharing projects across databases.")
- Take another snapshot. ADE-bench takes another snapshot to log what changes made in the step above.
- Ask the agent to do stuff. The environment is handed to the agent, which then tries to resolve the task.
- Take a final snapshot. Once the agent declares itself done, ADE-bench takes a third snapshot.
- Evaluate the result. The changes are evaluated against the tests specified in the task. If all the tests pass, the task passes. Note: ADE-bench includes automatic ways to compare tables to one another. For example, if you want to evaluate a task by seeing if the agent created the correct
dim_userstable, you can define this table in the task configuration, and the comparison test is generated automatically. - Clean up the sandbox. Once the task has recorded its results, ADE-bench deletes the container.
ADE-bench runs with the ade CLI command. Here is an example command for running tasks:
ade run \
foo001 foo002 \ # The tasks to run. See the task selection list immediately below for more details.
--db snowflake \ # Which database variant to run
--project-type dbt \ # Which project variant to run (currently dbt or dbt-fusion)
--agent sage \ # Which agent to use (e.g., `sage`, `claude`, `codex`)
--model claude-opus-4-5-20251101 \ # Optional; which specific model to use for different agents.
--exclude_task_ids foo001 \ # Optional; if you run all tasks, you can exclude tasks with this flag.
--n-concurrent-trials 4 \ # Optional; sets trial concurrency limit; defaults to 4.
--n-attempts 1 \ # Optional; sets times to run each task; defaults to 1.
--max-episodes 50 \ # Optional; the maximum number of calls to an agent's LM; defaults to 50.
--seed \ # Optional; flag for creating solution seed CSVs !! DESTRUCTIVE !! RUN WITH CAUTION !! SEE BELOW !!
--no-diffs \ # Optional; disables taking snapshots of diffs for faster performance.
--persist \ # Optional; keeps the container alive after the trial is over or is aborted.
--use-mcp \ # Optional; creates an dbt MCP server for the agent. Note: Not all agents and databases are supported.
--tasks-dir /absolute/path/to/tasks \ # Optional; path to an external tasks directory. Defaults to 'tasks' in the current directory.See also: complete usage documentation.
There are several ways to select tasks to run:
ade run airbnb001 # Run a specific task
ade run f1006.hard # Run a specific task with a specific variant
ade run airbnb001 airbnb002 # Run multiple tasks
ade run all # Run all ready tasks
ade run @coalesce # Run an experiment set
ade run f1+ simple+ # Run wildcard patternsBefore getting into the details, it's useful to see how tasks are configured. This will help define the outline for how ADE-bench works, and the sections below will fill in the details.
Each task folder contains a handful of files:
task.yaml- The task's configuration. This is an important file, so all the details are below.setup.sh– A setup script that runs before the agent is given the task, for modifying files and doing other computer stuff.setup/– (Optional) A directory containing files that the setup script can use. For example, suppose you want to replacea_good_working_file.sqlwithone_with_a_bug.sql. Put the one with a bug here and usesetup.shto replace it. This lets you modify the project for the specific task, without having to make changes to the shared project.solution.sh– A script that solves the task. The sage agent is a agent that just runs this script. See "The sage agent" below for more.solutions/– (Optional) Files that are available to the solution script. This is exactly analogous to the/setupdirectory for the setup script.tests/- dbt tests that are used to evaluate the trial. For a trial to pass, all the tests in this directory must pass. You can add manual tests, and if you includesolution_seedsin the task configuration, tests will get added here automatically when a task is run. Automatically generated tests are appended with the nameAUTO_. See "How trials are evaluated" below for more.seeds/- CSVs that are used when evaluating the automatically-generated equality tests.
The task is defined in the task.yaml file:
# task.yaml
task_id: The name of the task, which should match the name of the task directory.
status: An indicator of the task's development status. [ready | dev | open]
description: A description of what the task does
notes: |-
Optional notes about the task, things to add to it, issues with it, etc.
prompts: #
# Tasks can have multiple prompt variants to see how well agents resolve issues
# with different levels of detail. The base prompt is required; additional prompt
# variants are optional. Each prompt will run as its own trial in its own sandbox.
- key: base
prompt: |-
The prompt to give to the agent.
- key: hard
prompt: |-
Optional prompt variants.
author_name: The name of the task author.
author_email: The author's email.
difficulty: How hard the task
8000
is.
category: The category of problem the task is testing.
tags:
# Additional tags to help with organization
- dbt
- jinja
# An optional script to run after the agent runs, and before the tests are evaluated.
# This should be used if
# 1. the agent might update a file as part of the task
# 2. and might not run dbt,
# 3. which should be considered a success,
# 4. but dbt needs to run to evaluate the tests.
test_setup: |-
dbt run --select obt_product_inventory
# ADE-bench will automatically do several things with the tables listed here:
# 1. Create solution seeds (called `solution__[table_name]`) after the agent runs.
# 2. Create a dbt test that checks for the existence of this table
# 3. Create a test to compare the table to corresponding solution seed.
#
# As is shown for with the tables below, you can:
# 1. Exclude some columns from the equality test
# 2. Only include certain columns in the equality test
# 3. Disable the creation of both the existence and equality tests.
# 4. Specify alternate solution seeds (for tasks with multiple valid answers)
solution_seeds:
- table_name: dim_products
- table_name: fct_sales
include_columns:
- id
- item_name_displayed
- item_name_actual
- table_name: fct_transactions
exclude_columns:
- bribe_amount_usd
- table_name: generated_income_statement
exclude_tests:
- equality_test # Can also be `existence_test`
- table_name: sec_filing_report
alternates:
- sec_filing_report_with_fudges
# These are the different database and project variants that the task supports
# If no migration directory is provided, then no migration script will run.
# If one is provided, it will be run against the project prior to the the task's
# setup.sh script.
variants:
- db_type: duckdb
db_name: enron
db_dir: /path/to/databases # Optional, directory containing enron.duckdb
project_type: dbt
project_name: enron
project_dir: /path/to/projects # Optional, directory containing 'enron' project folder
- db_type: snowflake
db_name: enron
project_type: dbt
project_name: enron
migration_directory: enron__duckdb_to_snowflakeWe skimmed over lots of details in the section above. This next section explains the painful tedium of those details.
After the agent runs, tasks are graded via dbt tests. You can add whatever manual tests you want to this directory. Additionally, ADE-bench will automatically create tests for you based on solution seeds in your task.
More specifically, when the agent completes its work, several things happen:
- If there are solution_seeds defined in the
task.yamlfile, the\seedsdirectory is copied into the sandbox. ADE-bench then runsdbt seedto create these tables in the database environment. - The existing test directory in the sandbox is replaced with the tests in the
\testsdirectory of the task folder. If a task has no solution seeds, then it will only have manual tests; if a task has solution seeds, ADE-bench will automatically generate two tests (AUTO_{table_name}_equality.sqlandAUTO_{table_name}_existence.sql) and copy those. Note: ADE-bench automatically removes and regenerates allAUTO_tests prior to evaluating each trial, so if you update these files directly, your changes will probably get wiped away. - The tests in that directory are run and the results are recorded. If all of them pass, the task passes. If any fail, the task fails.
ADE-bench currently supports these database types:
- DuckDB
- Snowflake
DuckDB databases should be stored in shared/databases/duckdb. When a task is run against a DuckDB database, ADE-bench simply copies the .duckdb file from the shared directory into the task container.
Alternatively, you can use the db_dir field in the variant configuration to specify a different directory containing the database file. You can also set the ADE_DATABASES_DIR environment variable to change the default location for all tasks.
Snowflake is more complicated.
When ADE-bench runs a task, three DDL scripts (found in /ade_bench/setup/snowflake_setup.py) are executed by the admin user defined in .env (see Snowflake setup):
- Clone the original database, with the name
TEMP_ADE_[name_of_task]_DATABASE. (If that database already exists, it is first dropped, so this will reset any prior trials of that task.) - Create a new Snowflake user and role for the trial. The user is
TEMP_ADE_[name_of_task]_USER; the role isTEMP_ADE_[name_of_task]_ROLE. If these already exist, they are also dropped. - Grant the appropriate permissions to the user and role. Snowflake is very particular about all of this; if you want to see the specifics, go to the
.pyfile above.
The Snowflake credentials for this new user are added to profiles.yml in the dbt project. (Technically, this happens when the dbt project is set up and not after the Snowflake environment is set up, but, same same.)
The trial is then conducted using the user created in the step above. Once the trial is complete, the users, roles, and databases all persist in the Snowflake account. If you want to clean these up, you can run this query, which will generate a list of drop statements you can run all at once in the Snowflake UI:
Generate drop statements
select
'user' as object_type,
'drop user if exists "' || name || '";' as drop_statement
from snowflake.account_usage.users
where deleted_on is null
and name ilike 'TEMP_ADE_%'
union all
select
'role' as object_type,
'drop role if exists "' || name || '";' as drop_statement
from snowflake.account_usage.roles
where deleted_on is null
and name ilike 'TEMP_ADE_%'
union all
select
'database' as object_type,
'drop database if exists "' || database_name || '";' as drop_statement
from snowflake.account_usage.databases
where deleted is null
and database_name ilike 'TEMP_ADE_%'
order by object_type, drop_statement;
As ADE-bench runs, it will tick through the progress of each trial. After the entire suite runs, it will print a summary of the results. To view a more detailed view of the results, run:
ade viewThis will open a local HTML page that includes much more detail, including details for each task:
- Results: Detailed JSON results of the trial.
- Panes: Terminal output from the setup, agent, and test phases of the trial.
- Diffs: File changes during task setup and changes made by the agent. If ADE-bench is run with the
--no-diffsflag, these will not be available.
ADE-bench currently supports the following agents:
- Claude Code -
--agent claude - OpenAI Codex -
--agent codex - Gemini CLI -
--agent gemini
Optionally, you can include a --model {model-id} flag to specify which model to use. For more information on models and their IDs, refer to documentation provided by Claude, Codex, and Gemini.
The following commands are run to call each agent:
# CLAUDE
claude --output-format json -p {task_prompt} --model {model-id} \
--allowedTools Bash Edit Write NotebookEdit WebFetch
# CODEX
printenv OPENAI_API_KEY | codex login --with-api-key && \
codex --ask-for-approval never --model {model-id} exec --sandbox workspace-write --skip-git-repo-check {task_prompt}
# GEMINI
gemini --output-format json --yolo --prompt {task_prompt} --model {model-id}Configuration files for each agent are found in the /shared/config directory. You can use CLAUDE.md to configure Claude Code, AGENTS.md to configure Codex, and GEMINI.md to configure Gemini.
If run with the flag --use-mcp, ADE-bench will create a dbt MCP server that the agent is allowed to use. The following databases and agents are supported:
- Databases:
snowflake(duckdb doesn't support multiple simultaneous connections) - Agents:
claude,codex,gemini
Because the server runs locally, it only has access to the CLI tools. The others are disabled, because they require access to the dbt platform.
The sage agent is not really an agent—or, it's a very dumb agent. It's an agent that runs $ bash solution.sh and then shuts itself down.
In other words, the sage agent is the answer key to the task. When ADE-bench is run where --agent sage all the tests should pass. If they don't, something is wrong with the tasks themselves. So, it's important for tasks to have a solution script to make sure that the task is correctly configured.
Note: The solution.sh script might be different for different database and project variants. More on how to handle this is discussed in the contributing guide.
Configuration is all managed through the .env file. Create this file (if it doesn't already exist) by making a copy of .env.example. It's fairly self-explanatory, but for completeness, why not:
# API Keys for LLM providers
ANTHROPIC_API_KEY=[API key for Anthropic, if you want to run Claude Code agents]
OPENAI_API_KEY=[API key for OpenAI, if you want to run Codex agents]
GEMINI_API_KEY=[API key for Gemini, if you want to run Gemini CLI]
# Snowflake configuration
# This is the admin user as described in the Snowflake setup below.
SNOWFLAKE_ACCOUNT=[ORGANIZATION_NAME-ACCOUNT_NAME]
SNOWFLAKE_USER=ADE_BENCH_ADMIN_USER
SNOWFLAKE_ROLE=ADE_BENCH_ADMIN_ROLE
SNOWFLAKE_PASSWORD=[the user's password]
SNOWFLAKE_WAREHOUSE=[the warehouse you want this user and all ade-bench agents to use]
# Timeout Settings (in seconds)
# These are default timeouts for each stage of the harness.
SETUP_TIMEOUT_SEC=120 # How long setup tasks can run
DEFAULT_AGENT_TIMEOUT_SEC=300 # How long the agent can run
DEFAULT_TEST_TIMEOUT_SEC=180 # How long test scripts can run
CLEANUP_TIMEOUT_SEC=60 # How long cleanup scripts can run
# Directory Configuration (optional, defaults shown)
# ADE_TASKS_DIR=tasks # Path to directory where tasks are located
# ADE_DATABASES_DIR=shared/databases # Path to directory where DuckDB databases are located
# ADE_PROJECTS_DIR=shared/projects # Path to directory where dbt projects are located
# Docker Configuration
DOCKER_DEFAULT_PLATFORM=linux/amd64
# Logging
# This stuff is for nerds.
USE_DYNAMIC_LOGGING=TRUE # Set to FALSE if you want normal logs and not a fancy table.
FILE_DIFF_EXCLUDE_PATHS=/tmp,/logs,/var,/target,/build,/node_modules
LOG_LEVEL=INFOCaution
Do not connect ADE-bench to your production Snowflake account when using shared datasets!
It assumes it has free rein to create and drop objects in the database, and runs a lot of dangerous DDL statements in the process.
Sign up for a trial Snowflake account.
ADE-bench will use this user to create an empty testing environment (database, user, role) at the beginning of each trial.
Run the following commands:
-- Put a secure password here, inside 'single quotes'.
-- Make sure you keep the semicolon after the quotes.
SET PASSWORD_VARIABLE= ;
-- Create role
CREATE OR REPLACE ROLE ADE_BENCH_ADMIN_ROLE;
-- Configure role
GRANT CREATE DATABASE ON ACCOUNT TO ROLE ADE_BENCH_ADMIN_ROLE;
GRANT MANAGE GRANTS ON ACCOUNT TO ROLE ADE_BENCH_ADMIN_ROLE;
GRANT CREATE USER ON ACCOUNT TO ROLE ADE_BENCH_ADMIN_ROLE;
GRANT CREATE ROLE ON ACCOUNT TO ROLE ADE_BENCH_ADMIN_ROLE;
GRANT USAGE ON WAREHOUSE COMPUTE_WH TO ROLE ADE_BENCH_ADMIN_ROLE;
-- Create admin user
CREATE OR REPLACE USER ADE_BENCH_ADMIN_USER
PASSWORD=$PASSWORD_VARIABLE
DEFAULT_ROLE=ADE_BENCH_ADMIN_ROLE
MUST_CHANGE_PASSWORD=FALSE
TYPE=LEGACY_SERVICE
DEFAULT_WAREHOUSE=COMPUTE_WH;
-- Configure admin user
GRANT ROLE ADE_BENCH_ADMIN_ROLE TO USER ADE_BENCH_ADMIN_USER;
SELECT 'SNOWFLAKE_ACCOUNT=' || CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() as ".env values"
UNION ALL
SELECT 'SNOWFLAKE_USER=ADE_BENCH_ADMIN_USER'
UNION ALL
SELECT 'SNOWFLAKE_PASSWORD=' || $PASSWORD_VARIABLE
UNION ALL
SELECT 'SNOWFLAKE_WAREHOUSE=' || CURRENT_WAREHOUSE()
UNION ALL
SELECT 'SNOWFLAKE_ROLE=ADE_BENCH_ADMIN_ROLE'
;This is the same file you created when specifying your API key.
SNOWFLAKE_ACCOUNT=[ORGANIZATION_NAME-ACCOUNT_NAME]
SNOWFLAKE_USER=ADE_BENCH_ADMIN_USER
SNOWFLAKE_ROLE=ADE_BENCH_ADMIN_ROLE
SNOWFLAKE_PASSWORD=[the user's password]
SNOWFLAKE_WAREHOUSE=[the warehouse you want this user and all ade-bench agents to use]Snowflake is deprecating password-based authentication, which makes this very annoying. For now, you can get around this by making this user LEGACY_SERVICE user, though that will eventually go away. A problem for our future selves, who might be able to solve it with the help of these dbt and Snowflake docs.
Caution
Do not connect ADE-bench to your production Snowflake account when using shared datasets!
It assumes it has free rein to create and drop objects in the database, and runs a lot of dangerous DDL statements in the process.
ADE-bench includes a command to migrate all of the DuckDB files in /shared/databases/duckdb into their own databases (matching the name of the DuckDB database) in the Snowflake account in your .env file.
ade migrate duckdb-to-snowflake --use-database-export# Only foo.duckdb and bar.duckdb
ade migrate duckdb-to-snowflake --include foo bar
# All but bar.duckdb
ade migrate duckdb-to-snowflake --exclude bar- Login at https://platform.claude.com
- Go to https://platform.claude.com/settings/keys
- Create API key
- Create in Workspace: Default
- Name your key: whatever you want :)
- Save the key in a secure location
- Assign the key to
ANTHROPIC_API_KEYwithin the.envfile
- Disable diffs to speed things up. If you're trying to run a bunch of tasks, or testing something where you don't care about looking at diffs, use the
--no-diffsflag. Diffs can be slow, and ADE-bench will run faster if they're turned off. - View all the tasks. Running
ade view taskswill print a summary table of all the tasks in ADE-bench. You can also run ``ade view tasks --copy` to copy a more detailed summary of the tasks to your clipboard.
ADE-bench was created by Benn Stancil with support from dbt Labs and Macro. Their contributions to this project are what made it possible; you can thank both by checking out their other work.
ADE-bench also borrowed heavily from Terminal-Bench and Spider 2.0, without which none of this would exist:
- The Terminal-Bench Team. (Apr 2025). Terminal-Bench: A Benchmark for AI Agents in Terminal Environments. Available: https://github.com/laude-institute/terminal-bench
- Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, & Tao Yu. (2024). Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows. Available: https://arxiv.org/abs/2411.07763
If you want to learn more or get involved, you can find us in #tools-ade-bench in the dbt Labs community Slack. Also, feel free to reach out to Benn at benn.electronicmail@gmail.com.
Footnotes
-
ADE-bench is short for "Analytics and data engineering benchmark," and is pronounced
lemonade-bench and not AYE-DEE-EE-bench, because that is how we should pronounce things around here. ↩