Comprehensive benchmarking system to evaluate ReasoningBank's closed-loop learning capabilities against a baseline system without memory.
This suite measures the impact of ReasoningBank's 4-phase learning loop (RETRIEVE → JUDGE → DISTILL → CONSOLIDATE) on agent performance across multiple dimensions.
- Success Rate: Task completion success over time (0% → 100% transformation)
- Learning Velocity: Speed of improvement (tasks until first success)
- Memory Efficiency: Storage and retrieval performance
- Token Efficiency: Reduction in token usage via memory
- Latency Impact: Overhead from memory operations
- Accuracy: Quality of learned patterns
- Generalization: Transfer learning across domains
bench/
├── scenarios/ # Test scenarios (coding, debugging, API design, etc.)
├── agents/ # Baseline vs ReasoningBank agents
├── metrics/ # Metrics collection and analysis
├── results/ # Raw benchmark results
├── reports/ # Generated reports and visualizations
├── lib/ # Shared utilities
├── benchmark.ts # Main orchestrator
└── README.md # This file
# Run all benchmarks
npm run bench
# Run specific scenario
npm run bench -- --scenario coding-tasks
# Run with different configurations
npm run bench -- --iterations 100 --agents 5
# Generate report only
npm run bench:report- Implement functions from specifications
- Fix bugs in existing code
- Refactor code for best practices
- Add error handling
- Write unit tests
- Identify and fix runtime errors
- Resolve type errors
- Fix logical errors
- Handle edge cases
- Debug async issues
- Design REST endpoints
- Create authentication systems
- Implement rate limiting
- Design database schemas
- Build validation logic
- Algorithm challenges
- Data structure problems
- System design questions
- Optimization problems
- Pattern recognition
- Same task repeated with variations
- Measures learning curve
- Tests memory consolidation
- Validates pattern generalization
Based on the ReasoningBank paper (https://arxiv.org/html/2509.25140v1):
- Success Rate: 0% → 100% over 20-30 iterations
- Token Reduction: ~32.3% via memory injection
- Latency Overhead: <500ms for retrieval
- Memory Efficiency: 95%+ deduplication rate
- Learning Velocity: 2.8-4.4x faster than baseline
- Both agents start from scratch
- Measures raw learning capability
- Tests distillation quality
- ReasoningBank has domain memories
- Tests retrieval effectiveness
- Measures transfer learning
- Sequential task variations
- Tests consolidation
- Measures adaptation speed
- High memory load (1000+ patterns)
- Concurrent operations
- Edge cases and failures
🧪 Running Benchmark: coding-tasks (1/50)
├─ Baseline: [=====> ] 50% (5/10 success)
└─ ReasoningBank: [==========>] 100% (10/10 success)
{
"scenario": "coding-tasks",
"baseline": {
"successRate": 0.50,
"avgTokens": 15230,
"avgLatency": 2341
},
"reasoningbank": {
"successRate": 1.00,
"avgTokens": 10311,
"avgLatency": 2756,
"memoriesUsed": 12,
"memoriesCreated": 8
},
"improvement": {
"successRate": "+100%",
"tokenEfficiency": "+32.3%",
"latencyOverhead": "+17.7%"
}
}- Executive summary
- Detailed metrics tables
- Comparison charts
- Learning curves
- Recommendations
Edit bench/config.json:
{
"iterations": 50,
"parallelAgents": 5,
"scenarios": ["coding", "debugging", "api-design"],
"enableWarmStart": false,
"memorySize": 1000,
"outputFormats": ["json", "markdown", "csv"]
}> +50%: ReasoningBank significantly improves learning20-50%: Moderate improvement, memory helps< 20%: Minimal impact, task may not benefit from memory
> +30%: Excellent memory utilization15-30%: Good reduction via context injection< 15%: Limited memory reuse
< 500ms: Acceptable for production500-1000ms: Noticeable but manageable> 1000ms: Optimization needed
Add new scenarios in bench/scenarios/:
export const myScenario: BenchmarkScenario = {
name: 'my-scenario',
description: 'Custom benchmark scenario',
tasks: [/* tasks */],
successCriteria: (result) => /* validation */,
metrics: ['success', 'tokens', 'latency']
};- ReasoningBank Paper: https://arxiv.org/html/2509.25140v1
- 4-Factor Scoring Formula: α·sim + β·rec + γ·rel + δ·div
- MaTTS (Memory-aware Test-Time Scaling)
- MMR (Maximal Marginal Relevance) for diversity
To add benchmarks:
- Create scenario in
scenarios/ - Add to
benchmark.tsorchestrator - Update metrics collection
- Run and validate results
- Document in this README