GitTaskBench

📊 Motivation and Goal

The ultimate vision for AI agents is to enable users to accomplish real-world tasks simply by describing their needs in natural language—leaving all planning and execution to the agent, which delivers the final results autonomously.

A simple illustration of General AI Agent.

While existing benchmarks evaluate various agent capabilities, few focus on tasks that reflect genuine real-world practicality, especially those requiring comprehensive understanding and use of full-scale project repositories.

To address this gap, we introduce GitTaskBench. Our benchmark focuses on tasks whose complexity and practical value demand leveraging repository-level code, mirroring how developers solve real problems using existing GitHub projects.

We carefully selected 54 representative tasks with real-world economic value, and for each task, searched and identified a corresponding GitHub repository that meets strict selection criteria (the repository for each task is fixed to ensure benchmark completeness, as some agent frameworks do not support searching for appropriate repositories). This setup allows us to systematically evaluate LLM agents' ability to utilize open-source repositories to solve complex, realistic problems.

By doing so, GitTaskBench offers a more authentic and comprehensive assessment of agent performance in practical, repository-driven environments.

👋 Overview

GitTaskBench is a comprehensive benchmark designed to evaluate the capabilities of intelligent agents across multiple modalities and task complexities. It encompasses 54 tasks spanning 7 key domains.

Each domain features a curated set of tasks that reflect real-world applications and research challenges. These tasks assess an agent's autonomous ability to interpret complex instructions, process multi-modal inputs, perform reasoning, understand and explore the GitHub repositories, and deliver accurate, meaningful outputs.

✅ Task Distribution

Domain	Task List
Image Processing	Style Transfer, Image Coloring, Image Restoration, Scratch Detection, Image Enhancement, Background Processing, Watermark Embedding
Video Processing	Video Action Analysis, Style Transfer, Video Coloring
Speech Processing	Speech Recognition, Speech Separation, Speech Enhancement, Noise Reduction, Speech Analysis
Physiological Signals Processing	EDA (Electrodermal Activity) Data Analysis, ECG (Electrocardiogram) Data Analysis, EOG (Electrooculogram) Data Analysis
Security and Privacy	Data Simulation, Watermark Embedding, Watermark Extraction
Web Scraping	Web Content Extraction, Format Transformation
Office Document Processing	Excel Document Parsing, PDF Content Extraction, PDF Content Processing

📝 Use Cases

Case 1: PDF Email Extraction

task = """
Extract all email addresses found in the given PDF and save them to a text file.
Input file: /path/to/document.pdf
Output requirement: Save as output.txt
"""

# evaluation metrics = """
# Process: True/False (Data Integrity Check)  
#     -- Confirms prediction file and ground truth file accessibility  
#     -- Validates file parsing success (no read errors)  

# Result: True/False (Performance Threshold)  
#     -- Calculates accuracy: (Correct Emails / Total Ground Truth) ×100%  
#     -- Applies pass criterion: Accuracy ≥98%   
# """

Case 2: Video Coloring

task = """
Colorize the provided black-and-white video to produce a fully colored version.
Input file: /path/to/black_and_white_video
Output requirement: Output video is named as "output"
"""

# evaluation metrics = """
# Process: True/False (Technical Validity Verification)  
#     -- Verifies input video file existence and non-empty status  
#     -- Checks format compatibility (.mp4/.avi/.mov/.mkv)  
#     -- Validates frame extraction capability  

# Result: True/False (Color Intensity Threshold)  
#     -- Samples 30 frames with standardized width (256px)  
#     -- Computes per-frame colorfulness via Hasler-Bülthoff metric  
#     -- Aggregates scores to calculate video-level average  
#     -- Pass/Fail determination (Threshold: >10.0 )  
# """

Case 3: Image Watermark Embedding

task = """
Embed a blind (invisible) watermark into the given PNG image.
Input file: /path/to/image.png
Output requirement: Save as output.png
"""

# evaluation metrics = """
#  Process: True/False (Input Validation)  
#     -- Verifies existence and non-empty status of original/watermarked images  
#     -- Checks image file integrity (readable formats via OpenCV)  

# Result: True/False (Watermark & Quality Compliance)  
#     -- Extracts watermark text using DWT-DCT decoding  
#     -- Matches extracted text against ground truth (100% match required)  
#     -- Computes PSNR between original and watermarked images (≥30.0 dB threshold)  
#     -- Final pass requires both watermark match AND PSNR compliance  
# """

✨ Key Features:

Multi-Modal Support: Encompasses vision, language, audio, time-series, and web-based data.
Diverse Task Types: Includes generation, recognition, enhancement, analysis, and simulation tasks.
Real-World Relevance: Tasks are derived from practical applications in media, healthcare, automation, and data science.
Scalability: Designed for future expansion with new tasks and evaluation metrics.

📖 Automation Evaluation

GitTaskBench evaluates two key aspects:

Execution Completion Rate: measures whether the agent can leverage the repository to produce any valid output.
Task Pass Rate : assesses whether the output meets task-specific evaluation criteria.

Given the diversity of tasks, all evaluation metrics are predefined and tailored to each task, drawing on commonly accepted standards within the developer community. This ensures a comprehensive and fair assessment.

🚀 Set Up

First, create a new conda environment:

conda create -n gittaskbench python=3.10 -y
conda activate gittaskbench

pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 \
  --extra-index-url https://download.pytorch.org/whl/cu113

Then, you can install gittaskbench with pip:

pip install -e .

also you can

pip install -r requirements.txt

🤖 Quick Start

gittaskbench [-v] grade --taskid <taskid> [--output_dir <output_dir>] [--result <result>]

🔧 Options:

--taskid : (Required in single task evaluation) The task identifier, e.g., Trafilatura_01.
-v: (Optional) Enable verbose output to display detailed error messages.
--output_dir : (Optional) The directory containing the agent's output files. If not specified, the default value is read from task_info.yaml.
--result :(Optional) The directory containing the agent's test results files. If not specified, the default value is read from task_info.yaml.

gittaskbench eval  [--result <result>]

🔧 Options:

--result :(Optional) The directory containing the agent's test results files. If not specified, the default value is test_results file in repo.

💡 Example:

Single Task Evaluation

If you only need to evaluate a single task, you can use the following command. The example below shows how to evaluate the Trafilatura_01 task:

gittaskbench grade --taskid Trafilatura_01

All Tasks Evaluation

When you need to evaluate all tasks, you can use the --all parameter. This command will automatically iterate through and execute the evaluation of all tasks:

gittaskbench grade --all

Test Results Analysis

After completing the evaluation, if you want to analyze the test results, you can use the eval command. This command will analyze the evaluation results in the specified directory and output an analysis report:

gittaskbench eval

🛠️ How to Adapt to Agent Frameworks

We provide detailed configuration guidelines on how to integrate GitTaskBench with existing state-of-the-art general-purpose agent frameworks, including OpenHands, SWE-Agent and Aider. This enables users to seamlessly run batches of benchmark tasks within their agent pipelines.

In fact, the batch runner we provide—designed to enable efficient execution of multiple tasks—is not limited to GitTaskBench, and can be broadly applied to other benchmarks and agent-based task suites as well.

👉 Configuration details for each agent framework are provided in the following files:

For OpenHands, see:
- OpenHands Configuration Guide
```
cd OpenHands
poetry run python run_batch.py
```
For SWE-Agent, see:
- SWE-Agent Configuration Guide 1
- SWE-Agent Configuration Guide 2
```
cd SWE_agent
bash run_batch.sh
```

For Aider, directly run:

cd Aider
bash run_aider_batch_litellm.sh

🤝 Contributing

We welcome community contributions! Please refer to the following guidelines:

Development Setup

git clone https://github.com/your-org/GitTaskBench.git
cd GitTaskBench

To learn more about automation evaluation, please refer to the 🚀 Set Up section.

Contribution Types

🐛 Bug fixes
✨ New feature development
📚 Documentation improvements
🧪 Test case additions
🔧 Repos and utilities

Submission Process

Fork the project and create a feature branch
Write code and tests
Ensure all tests pass
Submit Pull Request

⭐ If GitTaskBench helps you, please give us a star!

Made with ❤️ by the GitTaskBench Team

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Aider		Aider
OpenHands		OpenHands
SWE_agent		SWE_agent
code_base		code_base
config		config
figs		figs
gittaskbench.egg-info		gittaskbench.egg-info
gittaskbench		gittaskbench
groundtruth		groundtruth
queries		queries
run_auto_prompt		run_auto_prompt
test_reports		test_reports
test_results_for_show		test_results_for_show
test_scripts		test_scripts
README.md		README.md
Task_Success_Criteria.xlsx		Task_Success_Criteria.xlsx
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GitTaskBench

📊 Motivation and Goal

👋 Overview

✅ Task Distribution

📝 Use Cases

Case 1: PDF Email Extraction

Case 2: Video Coloring

Case 3: Image Watermark Embedding

✨ Key Features:

📖 Automation Evaluation

🚀 Set Up

🤖 Quick Start

🔧 Options:

🔧 Options:

💡 Example:

Single Task Evaluation

All Tasks Evaluation

Test Results Analysis

🛠️ How to Adapt to Agent Frameworks

🤝 Contributing

Development Setup

Contribution Types

Submission Process

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Nicole-Yi/GitTaskBench

Folders and files

Latest commit

History

Repository files navigation

GitTaskBench

📊 Motivation and Goal

👋 Overview

✅ Task Distribution

📝 Use Cases

Case 1: PDF Email Extraction

Case 2: Video Coloring

Case 3: Image Watermark Embedding

✨ Key Features:

📖 Automation Evaluation

🚀 Set Up

🤖 Quick Start

🔧 Options:

🔧 Options:

💡 Example:

Single Task Evaluation

All Tasks Evaluation

Test Results Analysis

🛠️ How to Adapt to Agent Frameworks

🤝 Contributing

Development Setup

Contribution Types

Submission Process

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages