A comprehensive evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.
This LLM-as-a-judge system evaluates computer-use agents by:
- Reading visual evidence: Processes screenshots from agent execution
- Analyzing trajectories: Reviews step-by-step actions and reasoning
- Grounding evaluation: Compares agent outputs against task requirements
- Providing structured feedback: Returns scores, error categories, and improvement tips
- Comprehensive Scoring: 0-100 score with success threshold at 70
- Error Categorization: Identifies specific failure modes (captcha, infinite loops, tool failures, etc.)
- Multi-Model Support: Works with OpenAI (GPT-4, GPT-5, etc.) and Google Gemini models
- Batch Processing: Evaluates multiple tasks concurrently
- Structured Output: JSON format with detailed reasoning and actionable feedback
- Python 3.8 or higher
- API key for OpenAI or Google Gemini
- Clone the repository:
git clone https://github.com/anaishowland/llm-judge-psai.git
cd llm-judge-psai- Install dependencies:
pip install -r requirements.txt- Configure environment variables:
Create a
.envfile in the root directory:
# For OpenAI models (GPT-4, GPT-5, etc.)
OPENAI_API_KEY=your_openai_api_key_here
# OR for Google Gemini models
GOOGLE_API_KEY=your_google_api_key_here
# Optional: Adjust concurrency (default: 50)
JUDGE_MAX_CONCURRENCY=100
# Optional: Set per-task timeout in seconds
JUDGE_TASK_TIMEOUT_SECONDS=300Evaluate a single episode or folder of tasks:
python evaluate_results.py /path/to/results_folder --model gpt-5python evaluate_results.py /path/to/results_folder \
--model gpt-5 \
--max-images 15 \
--output custom_output.json \
--temperature 0.0./run_judge.sh /path/to/results_folder --model gpt-5 --max-images 10eval_folder(required): Path to the episode folder or single task folder--model MODEL: LLM model to use (default:gpt-4o)- Recommended:
gpt-5for best accuracy - Cost-effective:
gpt-5-mini(faster, lower cost, slightly reduced accuracy) - Alternative:
gemini-2.5-pro,gemini-2.5-flash
- Recommended:
--max-images N: Maximum screenshots to include per task (default: 10)--output FILE: Output file path (default:llm_judge.jsonin eval folder)--temperature FLOAT: Temperature for LLM sampling (default: 0.0)
GPT-5 (Recommended) ⭐
- Best for: Highest accuracy evaluations
- Pros: Superior reasoning, better at detecting subtle errors
- Cons: Higher cost, slower latency
GPT-5-mini
- Best for: Cost-sensitive deployments
- Pros: Faster, significantly cheaper (~90% cost reduction)
- Cons: Slightly reduced accuracy compared to GPT-5
Gemini 2.5 Pro/Flash
- Best for: Alternative to OpenAI models
- Pros: Good performance, competitive pricing
- Cons: Different API requirements
The judge expects a specific folder structure for each task:
task_folder/
├── results.zst # Compressed agent execution results (or results.json)
├── screenshot_*.png # Sequential screenshots from agent execution
└── (optional) results.json
The judge reads the following information from each task result:
{
"task": "Task description string",
"taskId": "unique_task_identifier",
"steps": [
{
"model_output": {
"thinking": "Agent's reasoning for this step",
"memory": "Agent's memory/context",
"next_goal": "What the agent plans to do",
"action": [{"action_type": "parameters"}]
},
"result": [{"metadata": "action_result"}],
"state": {"url": "current_url", "title": "page_title"},
"screenshot": "path_to_screenshot"
}
],
"results": "Final output from the agent",
"error": null
}- task: Task description to understand requirements
- taskId: Unique identifier for the task
- steps: Array of agent actions with thinking/reasoning
- results: Final output/completion message from agent
- error: Any errors encountered during execution
- screenshots: Visual evidence of agent actions
The judge produces a JSON file (llm_judge.json by default) with evaluation results:
{
"task_id": "unique_task_id",
"task_description": "Browse and filter products...",
"llm_success": true,
"agent_success": true,
"evaluation": {
"task_summary": "Filter products on e-commerce site",
"reasoning": "Agent successfully applied filters and completed task...",
"error_categories": [],
"final_score": 95,
"improvement_tips": []
}
}{
"evaluation_folder": "/path/to/episode",
"folder_type": "single_episode",
"total_tasks": 10,
"evaluations_completed": 10,
"evaluations_failed": 0,
"overall_average_score": 84.0,
"overall_llm_success_rate": 0.8,
"overall_agent_success_rate": 0.8,
"episode_results": [
{
"episode": "0",
"evaluations": [...]
}
]
}- final_score: Integer 0-100 (≥70 = success, <70 = failure)
- task_summary: One-sentence summary of the task
- reasoning: Detailed explanation of the score
- error_categories: List of detected issues (e.g.,
captcha_unsolved,infinite_loop,tool_failed) - improvement_tips: Actionable suggestions for improvement
- llm_success: Whether the judge considers the task successful
- agent_success: Whether the agent reported success
The examples/ directory contains a basic example with input (results.json, screenshots) and output (llm_judge_output.json).
Run the judge on this example:
python evaluate_results.py examples/ --model gpt-5- Input Processing: Reads compressed or JSON result files and loads screenshots
- Context Building: Constructs a prompt with:
- Task description
- Agent's step-by-step actions and reasoning
- Screenshots showing visual state
- Final results/output
- LLM Evaluation: Sends to GPT-5 (or chosen model) for assessment
- Structured Output: Parses LLM response into standardized format
- Aggregation: Combines individual task evaluations into episode-level metrics
The judge uses a detailed system prompt with:
- Grounding rules: Must validate claims against screenshots
- Error taxonomy: 20+ specific error categories
- Scoring rubric: Clear criteria for 0-100 scale
- Output schema: Enforced JSON structure
# Set up environment
export OPENAI_API_KEY=your_key
# Run evaluation
python evaluate_results.py /path/to/results --model gpt-5Build the Docker image:
./build-judge.shRun in container:
docker run -e OPENAI_API_KEY=your_key \
-v /path/to/results:/data \
llm-judge:latest \
python evaluate_results.py /data --model gpt-5The system was designed to run in Google Cloud with GCS integration but can run anywhere with API access.
The judge can identify these error types:
- Access:
captcha_unsolved,login_failed,security_block - LLM:
rate_limited,llm_call_error - Planning:
infinite_loop,wrong_output_format,navigation_error,timeout - Browser:
element_interaction_error,browser_crashes,tool_failed - Task:
partial_output,impossible_task - Data:
file_system_misuse,data_not_saved,content_not_found
- Concurrency: Adjust
JUDGE_MAX_CONCURRENCYbased on rate limits - Image Count: Reduce
--max-imagesto lower token usage and cost - Model Choice: Use
gpt-5-minifor faster, cheaper evaluations - Caching: Results are written to disk; re-runs skip completed tasks
Import Errors: Ensure all dependencies are installed
pip install -r requirements.txtAPI Key Issues: Verify .env file or environment variables
echo $OPENAI_API_KEY # Should print your keyRate Limiting: Reduce JUDGE_MAX_CONCURRENCY
export JUDGE_MAX_CONCURRENCY=10Memory Issues: Reduce --max-images or process fewer tasks concurrently
If you use this LLM judge system, please cite:
@software{llm_judge_psai,
title={LLM as a Judge for Computer-Use Agents},
author={Anaïs Howland and Ashwin Thinnappan at Paradigm Shift AI},
year={2025},
url={https://github.com/anaishowland/llm-judge-psai}
}Created at Paradigm Shift AI.
MIT License - see LICENSE file for details.
Contributions are welcome! Please feel free to submit issues or pull requests.