Overview
Evaluation validates that:- Patches apply cleanly to the codebase
- Tests pass for each implemented feature
- Patches merge successfully (cooperative mode only)
- No test conflicts occur after merging (cooperative mode)
Quick start
Evaluate a completed experiment:- Discover all completed runs in
logs/my-experiment/ - Test each patch in isolated sandboxes
- Save results to
eval.jsonfiles - Display pass/fail summary
How evaluation works
Patch extraction
CooperBench extracts generated patches from experiment logs:Solo mode: Single patch containing both featuresCooperative mode: Separate patches from each agent
Patch sanitization
Patches are cleaned before testing:
- Test file changes are filtered out
- Trailing newlines are normalized
- Empty patches are detected
Sandbox creation
Each test runs in an isolated Docker container:The sandbox includes:
- Pre-built Docker image with repository code
- Test infrastructure and dependencies
- Isolated environment per evaluation
Patch testing (solo mode)
For solo mode, the single patch is tested against both feature test suites:Both feature tests must pass for the evaluation to succeed.
Patch merging (cooperative mode)
For cooperative mode, patches are merged before testing:
-
Apply first patch
-
Attempt merge
-
Check merge status
- ✓ Clean merge: No conflicts
- ⚠ Conflict: Overlapping changes detected
- ✗ Failed: Patch application error
Test execution
Each feature’s tests run independently:The
runner.sh script:- Applies patches in order
- Runs the test suite
- Parses test results
- Returns exit code 0 if all tests pass
Command reference
Basic usage
Experiment name to evaluate (from
logs/ directory)Filtering options
Evaluate only tasks from a specific subset
Evaluate only tasks from a specific repository
Evaluate a specific task ID
Evaluate a specific feature pair (comma-separated)
Execution options
Number of parallel evaluations
Evaluation backend:
modal, docker, or gcpForce re-evaluation even if
eval.json existsEvaluation metrics
Pass rate
The primary metric is pass rate: percentage of tasks where both features pass their tests.- Passed: Both feature test suites pass
- Failed: One or both feature test suites fail
- Errors: Patch application or test execution errors (excluded from pass rate)
Per-feature results
Each feature is evaluated independently:Merge analysis (cooperative mode)
Cooperative evaluations track merge success:Output structure
Evaluation results are saved alongside experiment logs:eval.json format
eval_summary.json format
Examples
Evaluate specific repository
Test only LlamaIndex tasks:Force re-evaluation
Re-run evaluation for all tasks:Large-scale GCP evaluation
Evaluate 1000+ tasks with high parallelism:Single task evaluation
Debug a specific task with detailed output:Understanding evaluation failures
Patch application failures
If a patch fails to apply:- Agent modified wrong files
- Patch format is invalid
- Base commit mismatch
Test failures
If tests fail after applying patch:- Incomplete implementation
- Logic errors in code
- Side effects on existing tests
Merge conflicts
If patches conflict in cooperative mode:- Indicates agents modified overlapping code regions
- Tests whether agents can work independently
- Key metric for cooperation effectiveness
Next steps
Running experiments
Learn how to run CooperBench experiments
Backends
Choose the right evaluation backend