cooperbench eval command evaluates agent runs by executing test suites in isolated sandboxes and computing success metrics.
Usage
Basic examples
Evaluate an experiment
logs/my-experiment/ directory.
Force re-evaluation
eval.json already exists.
Evaluate specific tasks
Parameters
Required
Experiment name to evaluate. Must match a directory in
logs/.Example: my-experiment (evaluates logs/my-experiment/)Task filtering
Use a predefined task subset.Example:
liteFilter by repository name.Example:
llama_index_taskFilter by specific task ID.Example:
8394Specific feature pair to evaluate, comma-separated.Example:
1,2Execution
Number of parallel evaluations.Default: 10
Execution backend for running test suites.Options:
modal- Modal cloud platform (default)docker- Local Docker containersgcp- Google Cloud Platform Batch jobs
Force re-evaluation even if
eval.json exists.How evaluation works
For each task instance:- Load agent patches - Reads
patch.difffrom agent logs - Create sandbox - Spins up isolated container with repository
- Apply patches - Applies agent changes to codebase
- Run tests - Executes test suite defined in task metadata
- Compute results - Records pass/fail for each test
- Save results - Writes
eval.jsonwith test outcomes
Evaluation output
Results are saved to:Example eval.json
Filtering examples
Evaluate specific subset
Evaluate specific repository
Evaluate specific task
Evaluate specific feature pair
Combine filters
Backend examples
Evaluate on Modal (cloud)
Evaluate locally with Docker
Evaluate on GCP Batch
cooperbench config gcp first.
Performance tuning
High concurrency for cloud
Low concurrency for local
Skip auto-evaluation
By default,cooperbench run automatically evaluates after completion. To disable:
Incremental evaluation
Evaluation skips tasks that already haveeval.json: