Skip to main content
CooperBench evaluates agent submissions by testing patches against feature-specific test suites in isolated sandboxes.

Overview

Evaluation validates that:
  1. Patches apply cleanly to the codebase
  2. Tests pass for each implemented feature
  3. Patches merge successfully (cooperative mode only)
  4. No test conflicts occur after merging (cooperative mode)

Quick start

Evaluate a completed experiment:
cooperbench eval -n my-experiment
This will:
  • Discover all completed runs in logs/my-experiment/
  • Test each patch in isolated sandboxes
  • Save results to eval.json files
  • Display pass/fail summary
By default, experiments are evaluated automatically as tasks complete. Use --no-auto-eval to disable this.

How evaluation works

1

Patch extraction

CooperBench extracts generated patches from experiment logs:Solo mode: Single patch containing both features
logs/{experiment}/solo/{repo}/{task_id}/f{f1}_f{f2}/solo.patch
Cooperative mode: Separate patches from each agent
logs/{experiment}/coop/{repo}/{task_id}/f{f1}_f{f2}/agent{f1}.patch
logs/{experiment}/coop/{repo}/{task_id}/f{f1}_f{f2}/agent{f2}.patch
2

Patch sanitization

Patches are cleaned before testing:
  • Test file changes are filtered out
  • Trailing newlines are normalized
  • Empty patches are detected
This ensures agents are evaluated only on implementation code, not test modifications.
3

Sandbox creation

Each test runs in an isolated Docker container:
# From src/cooperbench/eval/sandbox.py
eval_backend = get_backend(backend)
sandbox = eval_backend.create_sandbox(
    image=image,
    timeout=600,
    workdir="/workspace"
)
The sandbox includes:
  • Pre-built Docker image with repository code
  • Test infrastructure and dependencies
  • Isolated environment per evaluation
4

Patch testing (solo mode)

For solo mode, the single patch is tested against both feature test suites:
# Apply test suite and agent patch
bash /usr/local/bin/runner.sh tests.patch agent.patch
Both feature tests must pass for the evaluation to succeed.
5

Patch merging (cooperative mode)

For cooperative mode, patches are merged before testing:
  1. Apply first patch
    git apply agent1.patch
    git add .
    git commit -m "Feature 1"
    
  2. Attempt merge
    git apply agent2.patch
    
  3. Check merge status
    • Clean merge: No conflicts
    • Conflict: Overlapping changes detected
    • Failed: Patch application error
6

Test execution

Each feature’s tests run independently:
# Feature 1 tests
bash /usr/local/bin/runner.sh tests1.patch merged.patch

# Feature 2 tests
bash /usr/local/bin/runner.sh tests2.patch merged.patch
The runner.sh script:
  • Applies patches in order
  • Runs the test suite
  • Parses test results
  • Returns exit code 0 if all tests pass
7

Result aggregation

Evaluation results combine all test outcomes:
{
  "both_passed": true,
  "feature1": {
    "passed": true,
    "tests_passed": 8,
    "tests_failed": 0
  },
  "feature2": {
    "passed": true,
    "tests_passed": 12,
    "tests_failed": 0
  },
  "merge": {
    "status": "clean",
    "strategy": "apply"
  }
}

Command reference

Basic usage

cooperbench eval -n EXPERIMENT_NAME [OPTIONS]
-n, --name
string
required
Experiment name to evaluate (from logs/ directory)
cooperbench eval -n solo-msa-gemini-3-flash-lite

Filtering options

-s, --subset
string
Evaluate only tasks from a specific subset
cooperbench eval -n my-experiment -s lite
-r, --repo
string
Evaluate only tasks from a specific repository
cooperbench eval -n my-experiment -r llama_index_task
-t, --task
integer
Evaluate a specific task ID
cooperbench eval -n my-experiment -r llama_index_task -t 8394
-f, --features
string
Evaluate a specific feature pair (comma-separated)
cooperbench eval -n my-experiment -r llama_index_task -t 8394 -f 1,2

Execution options

-c, --concurrency
integer
default:"10"
Number of parallel evaluations
cooperbench eval -n my-experiment --concurrency 50
GCP backend can handle much higher concurrency (50-100+) than local Docker (5-10).
--backend
enum
default:"modal"
Evaluation backend: modal, docker, or gcp
# Use GCP for large-scale evaluation
cooperbench eval -n my-experiment --backend gcp --concurrency 100

# Use Docker for local testing
cooperbench eval -n my-experiment --backend docker --concurrency 5
--force
boolean
Force re-evaluation even if eval.json exists
cooperbench eval -n my-experiment --force

Evaluation metrics

Pass rate

The primary metric is pass rate: percentage of tasks where both features pass their tests.
pass_rate = passed / (passed + failed)
  • Passed: Both feature test suites pass
  • Failed: One or both feature test suites fail
  • Errors: Patch application or test execution errors (excluded from pass rate)

Per-feature results

Each feature is evaluated independently:
{
  "feature1": {
    "passed": true,
    "tests_passed": 8,
    "tests_failed": 0,
    "tests_total": 8,
    "output": "test_loader.py::test_load ... PASSED\n..."
  }
}

Merge analysis (cooperative mode)

Cooperative evaluations track merge success:
{
  "merge": {
    "status": "clean",
    "strategy": "apply"
  }
}

Output structure

Evaluation results are saved alongside experiment logs:
logs/
└── my-experiment/
    ├── eval_summary.json        # Aggregate evaluation stats
    └── solo/
        └── llama_index_task/
            └── 8394/
                └── f1_f2/
                    └── eval.json    # Task-specific evaluation

eval.json format

{
  "repo": "llama_index_task",
  "task_id": 8394,
  "features": [1, 2],
  "setting": "solo",
  "merge": null,
  "feature1": {
    "passed": true,
    "tests_passed": 8,
    "tests_failed": 0,
    "tests_total": 8,
    "output": "..."
  },
  "feature2": {
    "passed": true,
    "tests_passed": 12,
    "tests_failed": 0,
    "tests_total": 12,
    "output": "..."
  },
  "both_passed": true,
  "error": null,
  "evaluated_at": "2024-03-15T12:30:45"
}

eval_summary.json format

{
  "run_name": "my-experiment",
  "evaluated_at": "2024-03-15T12:40:00",
  "total_runs": 25,
  "passed": 23,
  "failed": 2,
  "errors": 0,
  "skipped": 0,
  "pass_rate": 0.92,
  "results": [
    {"run": "llama_index_task/8394/1,2", "status": "pass"},
    {"run": "dspy_task/142/1,3", "status": "fail"}
  ]
}

Examples

Evaluate specific repository

Test only LlamaIndex tasks:
cooperbench eval -n my-experiment -r llama_index_task
Output:
cooperbench eval my-experiment
runs: 12
backend: modal

✓ pass llama_index_task/8394 [1,2]
✓ pass llama_index_task/8421 [1,3]
✗ fail llama_index_task/8445 [2,3]
...

┌────────────┬─────┐
│ passed     │  11 │
│ failed     │   1 │
│ pass rate  │ 92% │
└────────────┴─────┘

Force re-evaluation

Re-run evaluation for all tasks:
cooperbench eval -n my-experiment --force

Large-scale GCP evaluation

Evaluate 1000+ tasks with high parallelism:
cooperbench eval \
  -n my-large-experiment \
  --backend gcp \
  --concurrency 100
Output:
cooperbench eval my-large-experiment
runs: 1247
backend: gcp

submitting to GCP Batch ━━━━━━━━━━━━━━━━━━━━━ 100%
provisioning VMs        ━━━━━━━━━━━━━━━━━━━━━ 100%
evaluating             ━━━━━━━━━━━━━━━━━━━━━ 100%
collecting results     ━━━━━━━━━━━━━━━━━━━━━ 100%

✓ pass llama_index_task/8394 [1,2]
✓ pass dspy_task/142 [1,3]
...

Single task evaluation

Debug a specific task with detailed output:
cooperbench eval -n my-experiment -r llama_index_task -t 8394 -f 1,2

Understanding evaluation failures

Patch application failures

If a patch fails to apply:
{
  "error": "patch does not apply",
  "both_passed": false
}
Common causes:
  • Agent modified wrong files
  • Patch format is invalid
  • Base commit mismatch

Test failures

If tests fail after applying patch:
{
  "feature1": {
    "passed": false,
    "tests_passed": 3,
    "tests_failed": 2,
    "output": "FAILED test_core.py::test_loader - AssertionError"
  }
}
Common causes:
  • Incomplete implementation
  • Logic errors in code
  • Side effects on existing tests

Merge conflicts

If patches conflict in cooperative mode:
{
  "merge": {
    "status": "conflict",
    "conflicted_files": ["src/core.py"]
  },
  "both_passed": false
}
Analysis:
  • Indicates agents modified overlapping code regions
  • Tests whether agents can work independently
  • Key metric for cooperation effectiveness

Next steps

Running experiments

Learn how to run CooperBench experiments

Backends

Choose the right evaluation backend