8000 GitHub - anaishowland/llm-judge-psai: Evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.
[go: up one dir, main page]

Skip to content

Evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.

License

Notifications You must be signed in to change notification settings

anaishowland/llm-judge-psai

Repository files navigation

LLM as a Judge for Computer-Use Agents

A comprehensive evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.

Overview

This LLM-as-a-judge system evaluates computer-use agents by:

  • Reading visual evidence: Processes screenshots from agent execution
  • Analyzing trajectories: Reviews step-by-step actions and reasoning
  • Grounding evaluation: Compares agent outputs against task requirements
  • Providing structured feedback: Returns scores, error categories, and improvement tips

Key Features

  • Comprehensive Scoring: 0-100 score with success threshold at 70
  • Error Categorization: Identifies specific failure modes (captcha, infinite loops, tool failures, etc.)
  • Multi-Model Support: Works with OpenAI (GPT-4, GPT-5, etc.) and Google Gemini models
  • Batch Processing: Evaluates multiple tasks concurrently
  • Structured Output: JSON format with detailed reasoning and actionable feedback

Installation

Prerequisites

  • Python 3.8 or higher
  • API key for OpenAI or Google Gemini

Setup

  1. Clone the repository:
git clone https://github.com/anaishowland/llm-judge-psai.git
cd llm-judge-psai
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment variables: Create a .env file in the root directory:
# For OpenAI models (GPT-4, GPT-5, etc.)
OPENAI_API_KEY=your_openai_api_key_here

# OR for Google Gemini models
GOOGLE_API_KEY=your_google_api_key_here

# Optional: Adjust concurrency (default: 50)
JUDGE_MAX_CONCURRENCY=100

# Optional: Set per-task timeout in seconds
JUDGE_TASK_TIMEOUT_SECONDS=300

Usage

Basic Usage

Evaluate a single episode or folder of tasks:

python evaluate_results.py /path/to/results_folder --model gpt-5

Advanced Usage

python evaluate_results.py /path/to/results_folder \
  --model gpt-5 \
  --max-images 15 \
  --output custom_output.json \
  --temperature 0.0

Using the Shell Script

./run_judge.sh /path/to/results_folder --model gpt-5 --max-images 10

Command-Line Arguments

  • eval_folder (required): Path to the episode folder or single task folder
  • --model MODEL: LLM model to use (default: gpt-4o)
    • Recommended: gpt-5 for best accuracy
    • Cost-effective: gpt-5-mini (faster, lower cost, slightly reduced accuracy)
    • Alternative: gemini-2.5-pro, gemini-2.5-flash
  • --max-images N: Maximum screenshots to include per task (default: 10)
  • --output FILE: Output file path (default: llm_judge.json in eval folder)
  • --temperature FLOAT: Temperature for LLM sampling (default: 0.0)

Model Recommendations

GPT-5 (Recommended) ⭐

  • Best for: Highest accuracy evaluations
  • Pros: Superior reasoning, better at detecting subtle errors
  • Cons: Higher cost, slower latency

GPT-5-mini

  • Best for: Cost-sensitive deployments
  • Pros: Faster, significantly cheaper (~90% cost reduction)
  • Cons: Slightly reduced accuracy compared to GPT-5

Gemini 2.5 Pro/Flash

  • Best for: Alternative to OpenAI models
  • Pros: Good performance, competitive pricing
  • Cons: Different API requirements

Input Format

The judge expects a specific folder structure for each task:

task_folder/
├── results.zst          # Compressed agent execution results (or results.json)
├── screenshot_*.png     # Sequential screenshots from agent execution
└── (optional) results.json

Required Fields in results.json or results.zst

The judge reads the following information from each task result:

{
  "task": "Task description string",
  "taskId": "unique_task_identifier",
  "steps": [
    {
      "model_output": {
        "thinking": "Agent's reasoning for this step",
        "memory": "Agent's memory/context",
        "next_goal": "What the agent plans to do",
        "action": [{"action_type": "parameters"}]
      },
      "result": [{"metadata": "action_result"}],
      "state": {"url": "current_url", "title": "page_title"},
      "screenshot": "path_to_screenshot"
    }
  ],
  "results": "Final output from the agent",
  "error": null
}

Key Fields Used by Judge:

  • task: Task description to understand requirements
  • taskId: Unique identifier for the task
  • steps: Array of agent actions with thinking/reasoning
  • results: Final output/completion message from agent
  • error: Any errors encountered during execution
  • screenshots: Visual evidence of agent actions

Output Format

The judge produces a JSON file (llm_judge.json by default) with evaluation results:

Single Task Evaluation

{
  "task_id": "unique_task_id",
  "task_description": "Browse and filter products...",
  "llm_success": true,
  "agent_success": true,
  "evaluation": {
    "task_summary": "Filter products on e-commerce site",
    "reasoning": "Agent successfully applied filters and completed task...",
    "error_categories": [],
    "final_score": 95,
    "improvement_tips": []
  }
}

Episode-Level Evaluation (Multiple Tasks)

{
  "evaluation_folder": "/path/to/episode",
  "folder_type": "single_episode",
  "total_tasks": 10,
  "evaluations_completed": 10,
  "evaluations_failed": 0,
  "overall_average_score": 84.0,
  "overall_llm_success_rate": 0.8,
  "overall_agent_success_rate": 0.8,
  "episode_results": [
    {
      "episode": "0",
      "evaluations": [...]
    }
  ]
}

Evaluation Fields

  • final_score: Integer 0-100 (≥70 = success, <70 = failure)
  • task_summary: One-sentence summary of the task
  • reasoning: Detailed explanation of the score
  • error_categories: List of detected issues (e.g., captcha_unsolved, infinite_loop, tool_failed)
  • improvement_tips: Actionable suggestions for improvement
  • llm_success: Whether the judge considers the task successful
  • agent_success: Whether the agent reported success

Examples

The examples/ directory contains a basic example with input (results.json, screenshots) and output (llm_judge_output.json).

Run the judge on this example:

python evaluate_results.py examples/ --model gpt-5

How It Works

  1. Input Processing: Reads compressed or JSON result files and loads screenshots
  2. Context Building: Constructs a prompt with:
    • Task description
    • Agent's step-by-step actions and reasoning
    • Screenshots showing visual state
    • Final results/output
  3. LLM Evaluation: Sends to GPT-5 (or chosen model) for assessment
  4. Structured Output: Parses LLM response into standardized format
  5. Aggregation: Combines individual task evaluations into episode-level metrics

The judge uses a detailed system prompt with:

  • Grounding rules: Must validate claims against screenshots
  • Error taxonomy: 20+ specific error categories
  • Scoring rubric: Clear criteria for 0-100 scale
  • Output schema: Enforced JSON structure

Running Locally vs. Cloud

Local Execution

# Set up environment
export OPENAI_API_KEY=your_key

# Run evaluation
python evaluate_results.py /path/to/results --model gpt-5

Docker/Cloud Execution

Build the Docker image:

./build-judge.sh

Run in container:

docker run -e OPENAI_API_KEY=your_key \
  -v /path/to/results:/data \
  llm-judge:latest \
  python evaluate_results.py /data --model gpt-5

The system was designed to run in Google Cloud with GCS integration but can run anywhere with API access.

Error Categories

The judge can identify these error types:

  • Access: captcha_unsolved, login_failed, security_block
  • LLM: rate_limited, llm_call_error
  • Planning: infinite_loop, wrong_output_format, navigation_error, timeout
  • Browser: element_interaction_error, browser_crashes, tool_failed
  • Task: partial_output, impossible_task
  • Data: file_system_misuse, data_not_saved, content_not_found

Performance Considerations

  • Concurrency: Adjust JUDGE_MAX_CONCURRENCY based on rate limits
  • Image Count: Reduce --max-images to lower token usage and cost
  • Model Choice: Use gpt-5-mini for faster, cheaper evaluations
  • Caching: Results are written to disk; re-runs skip completed tasks

Troubleshooting

Import Errors: Ensure all dependencies are installed

pip install -r requirements.txt

API Key Issues: Verify .env file or environment variables

echo $OPENAI_API_KEY  # Should print your key

Rate Limiting: Reduce JUDGE_MAX_CONCURRENCY

export JUDGE_MAX_CONCURRENCY=10

Memory Issues: Reduce --max-images or process fewer tasks concurrently

Citation

If you use this LLM judge system, please cite:

@software{llm_judge_psai,
  title={LLM as a Judge for Computer-Use Agents},
  author={Anaïs Howland and Ashwin Thinnappan at Paradigm Shift AI},
  year={2025},
  url={https://github.com/anaishowland/llm-judge-psai}
}

Created at Paradigm Shift AI.

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

About

Evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0