LLM as a Judge for Computer-Use Agents

A comprehensive evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.

Overview

This LLM-as-a-judge system evaluates computer-use agents by:

Reading visual evidence: Processes screenshots from agent execution
Analyzing trajectories: Reviews step-by-step actions and reasoning
Grounding evaluation: Compares agent outputs against task requirements
Providing structured feedback: Returns scores, error categories, and improvement tips

Key Features

Comprehensive Scoring: 0-100 score with success threshold at 70
Error Categorization: Identifies specific failure modes (captcha, infinite loops, tool failures, etc.)
Multi-Model Support: Works with OpenAI (GPT-4, GPT-5, etc.) and Google Gemini models
Batch Processing: Evaluates multiple tasks concurrently
Structured Output: JSON format with detailed reasoning and actionable feedback

Installation

Prerequisites

Python 3.8 or higher
API key for OpenAI or Google Gemini

Setup

Clone the repository:

git clone https://github.com/anaishowland/llm-judge-psai.git
cd llm-judge-psai

Install dependencies:

pip install -r requirements.txt

Configure environment variables: Create a .env file in the root directory:

# For OpenAI models (GPT-4, GPT-5, etc.)
OPENAI_API_KEY=your_openai_api_key_here

# OR for Google Gemini models
GOOGLE_API_KEY=your_google_api_key_here

# Optional: Adjust concurrency (default: 50)
JUDGE_MAX_CONCURRENCY=100

# Optional: Set per-task timeout in seconds
JUDGE_TASK_TIMEOUT_SECONDS=300

Usage

Basic Usage

Evaluate a single episode or folder of tasks:

python evaluate_results.py /path/to/results_folder --model gpt-5

Advanced Usage

python evaluate_results.py /path/to/results_folder \
  --model gpt-5 \
  --max-images 15 \
  --output custom_output.json \
  --temperature 0.0

Using the Shell Script

./run_judge.sh /path/to/results_folder --model gpt-5 --max-images 10

Command-Line Arguments

eval_folder (required): Path to the episode folder or single task folder
--model MODEL: LLM model to use (default: gpt-4o)
- Recommended: gpt-5 for best accuracy
- Cost-effective: gpt-5-mini (faster, lower cost, slightly reduced accuracy)
- Alternative: gemini-2.5-pro, gemini-2.5-flash
--max-images N: Maximum screenshots to include per task (default: 10)
--output FILE: Output file path (default: llm_judge.json in eval folder)
--temperature FLOAT: Temperature for LLM sampling (default: 0.0)

Model Recommendations

GPT-5 (Recommended) ⭐

Best for: Highest accuracy evaluations
Pros: Superior reasoning, better at detecting subtle errors
Cons: Higher cost, slower latency

GPT-5-mini

Best for: Cost-sensitive deployments
Pros: Faster, significantly cheaper (~90% cost reduction)
Cons: Slightly reduced accuracy compared to GPT-5

Gemini 2.5 Pro/Flash

Best for: Alternative to OpenAI models
Pros: Good performance, competitive pricing
Cons: Different API requirements

Input Format

The judge expects a specific folder structure for each task:

task_folder/
├── results.zst          # Compressed agent execution results (or results.json)
├── screenshot_*.png     # Sequential screenshots from agent execution
└── (optional) results.json

Required Fields in `results.json` or `results.zst`

The judge reads the following information from each task result:

{
  "task": "Task description string",
  "taskId": "unique_task_identifier",
  "steps": [
    {
      "model_output": {
        "thinking": "Agent's reasoning for this step",
        "memory": "Agent's memory/context",
        "next_goal": "What the agent plans to do",
        "action": [{"action_type": "parameters"}]
      },
      "result": [{"metadata": "action_result"}],
      "state": {"url": "current_url", "title": "page_title"},
      "screenshot": "path_to_screenshot"
    }
  ],
  "results": "Final output from the agent",
  "error": null
}

Key Fields Used by Judge:

task: Task description to understand requirements
taskId: Unique identifier for the task
steps: Array of agent actions with thinking/reasoning
results: Final output/completion message from agent
error: Any errors encountered during execution
screenshots: Visual evidence of agent actions

Output Format

The judge produces a JSON file (llm_judge.json by default) with evaluation results:

Single Task Evaluation

{
  "task_id": "unique_task_id",
  "task_description": "Browse and filter products...",
  "llm_success": true,
  "agent_success": true,
  "evaluation": {
    "task_summary": "Filter products on e-commerce site",
    "reasoning": "Agent successfully applied filters and completed task...",
    "error_categories": [],
    "final_score": 95,
    "improvement_tips": []
  }
}

Episode-Level Evaluation (Multiple Tasks)

{
  "evaluation_folder": "/path/to/episode",
  "folder_type": "single_episode",
  "total_tasks": 10,
  "evaluations_completed": 10,
  "evaluations_failed": 0,
  "overall_average_score": 84.0,
  "overall_llm_success_rate": 0.8,
  "overall_agent_success_rate": 0.8,
  "episode_results": [
    {
      "episode": "0",
      "evaluations": [...]
    }
  ]
}

Evaluation Fields

final_score: Integer 0-100 (≥70 = success, <70 = failure)
task_summary: One-sentence summary of the task
reasoning: Detailed explanation of the score
error_categories: List of detected issues (e.g., captcha_unsolved, infinite_loop, tool_failed)
improvement_tips: Actionable suggestions for improvement
llm_success: Whether the judge considers the task successful
agent_success: Whether the agent reported success

Examples

The examples/ directory contains a basic example with input (results.json, screenshots) and output (llm_judge_output.json).

Run the judge on this example:

python evaluate_results.py examples/ --model gpt-5

How It Works

Input Processing: Reads compressed or JSON result files and loads screenshots
Context Building: Constructs a prompt with:
- Task description
- Agent's step-by-step actions and reasoning
- Screenshots showing visual state
- Final results/output
LLM Evaluation: Sends to GPT-5 (or chosen model) for assessment
Structured Output: Parses LLM response into standardized format
Aggregation: Combines individual task evaluations into episode-level metrics

The judge uses a detailed system prompt with:

Grounding rules: Must validate claims against screenshots
Error taxonomy: 20+ specific error categories
Scoring rubric: Clear criteria for 0-100 scale
Output schema: Enforced JSON structure

Running Locally vs. Cloud

Local Execution

# Set up environment
export OPENAI_API_KEY=your_key

# Run evaluation
python evaluate_results.py /path/to/results --model gpt-5

Docker/Cloud Execution

Build the Docker image:

./build-judge.sh

Run in container:

docker run -e OPENAI_API_KEY=your_key \
  -v /path/to/results:/data \
  llm-judge:latest \
  python evaluate_results.py /data --model gpt-5

The system was designed to run in Google Cloud with GCS integration but can run anywhere with API access.

Error Categories

The judge can identify these error types:

Access: captcha_unsolved, login_failed, security_block
LLM: rate_limited, llm_call_error
Planning: infinite_loop, wrong_output_format, navigation_error, timeout
Browser: element_interaction_error, browser_crashes, tool_failed
Task: partial_output, impossible_task
Data: file_system_misuse, data_not_saved, content_not_found

Performance Considerations

Concurrency: Adjust JUDGE_MAX_CONCURRENCY based on rate limits
Image Count: Reduce --max-images to lower token usage and cost
Model Choice: Use gpt-5-mini for faster, cheaper evaluations
Caching: Results are written to disk; re-runs skip completed tasks

Troubleshooting

Import Errors: Ensure all dependencies are installed

pip install -r requirements.txt

API Key Issues: Verify .env file or environment variables

echo $OPENAI_API_KEY  # Should print your key

Rate Limiting: Reduce JUDGE_MAX_CONCURRENCY

export JUDGE_MAX_CONCURRENCY=10

Memory Issues: Reduce --max-images or process fewer tasks concurrently

Citation

If you use this LLM judge system, please cite:

@software{llm_judge_psai,
  title={LLM as a Judge for Computer-Use Agents},
  author={Anaïs Howland and Ashwin Thinnappan at Paradigm Shift AI},
  year={2025},
  url={https://github.com/anaishowland/llm-judge-psai}
}

Created at Paradigm Shift AI.

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
adapter		adapter
examples		examples
model		model
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
build-judge.sh		build-judge.sh
constants.py		constants.py
docker-entrypoint.sh		docker-entrypoint.sh
evaluate_results.py		evaluate_results.py
judge_system.py		judge_system.py
messages.py		messages.py
requirements.txt		requirements.txt
run_judge.sh		run_judge.sh
upload_to_gcs.py		upload_to_gcs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM as a Judge for Computer-Use Agents

Overview

Key Features

Installation

Prerequisites

Setup

Usage

Basic Usage

Advanced Usage

Using the Shell Script

Command-Line Arguments

Model Recommendations

Input Format

Required Fields in `results.json` or `results.zst`

Key Fields Used by Judge:

Output Format

Single Task Evaluation

Episode-Level Evaluation (Multiple Tasks)

Evaluation Fields

Examples

How It Works

Running Locally vs. Cloud

Local Execution

Docker/Cloud Execution

Error Categories

Performance Considerations

Troubleshooting

Citation

License

Contributing

About

Uh oh!

Releases

Packages

Languages

License

anaishowland/llm-judge-psai

Folders and files

Latest commit

History

Repository files navigation

LLM as a Judge for Computer-Use Agents

Overview

Key Features

Installation

Prerequisites

Setup

Usage

Basic Usage

Advanced Usage

Using the Shell Script

Command-Line Arguments

Model Recommendations

Input Format

Required Fields in results.json or results.zst

Key Fields Used by Judge:

Output Format

Single Task Evaluation

Episode-Level Evaluation (Multiple Tasks)

Evaluation Fields

Examples

How It Works

Running Locally vs. Cloud

Local Execution

Docker/Cloud Execution

Error Categories

Performance Considerations

Troubleshooting

Citation

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Required Fields in `results.json` or `results.zst`

Packages