Evaluation

CooperBench evaluates agent submissions by testing patches against feature-specific test suites in isolated sandboxes.

Overview

Evaluation validates that:

Patches apply cleanly to the codebase
Tests pass for each implemented feature
Patches merge successfully (cooperative mode only)
No test conflicts occur after merging (cooperative mode)

Quick start

Evaluate a completed experiment:

cooperbench eval -n my-experiment

This will:

Discover all completed runs in logs/my-experiment/
Test each patch in isolated sandboxes
Save results to eval.json files
Display pass/fail summary

By default, experiments are evaluated automatically as tasks complete. Use --no-auto-eval to disable this.

How evaluation works

Patch extraction

CooperBench extracts generated patches from experiment logs:Solo mode: Single patch containing both features

logs/{experiment}/solo/{repo}/{task_id}/f{f1}_f{f2}/solo.patch

Cooperative mode: Separate patches from each agent

logs/{experiment}/coop/{repo}/{task_id}/f{f1}_f{f2}/agent{f1}.patch
logs/{experiment}/coop/{repo}/{task_id}/f{f1}_f{f2}/agent{f2}.patch

Patch sanitization

Patches are cleaned before testing:

Test file changes are filtered out
Trailing newlines are normalized
Empty patches are detected

This ensures agents are evaluated only on implementation code, not test modifications.

Sandbox creation

Each test runs in an isolated Docker container:

# From src/cooperbench/eval/sandbox.py
eval_backend = get_backend(backend)
sandbox = eval_backend.create_sandbox(
    image=image,
    timeout=600,
    workdir="/workspace"
)

The sandbox includes:

Pre-built Docker image with repository code
Test infrastructure and dependencies
Isolated environment per evaluation

Patch testing (solo mode)

For solo mode, the single patch is tested against both feature test suites:

# Apply test suite and agent patch
bash /usr/local/bin/runner.sh tests.patch agent.patch

Both feature tests must pass for the evaluation to succeed.

Patch merging (cooperative mode)

For cooperative mode, patches are merged before testing:

Apply first patch

git apply agent1.patch
git add .
git commit -m "Feature 1"

Attempt merge
```
git apply agent2.patch
```
Check merge status
- ✓ Clean merge: No conflicts
- ⚠ Conflict: Overlapping changes detected
- ✗ Failed: Patch application error

Test execution

Each feature’s tests run independently:

# Feature 1 tests
bash /usr/local/bin/runner.sh tests1.patch merged.patch

# Feature 2 tests
bash /usr/local/bin/runner.sh tests2.patch merged.patch

The runner.sh script:

Applies patches in order
Runs the test suite
Parses test results
Returns exit code 0 if all tests pass

Result aggregation

Evaluation results combine all test outcomes:

{
  "both_passed": true,
  "feature1": {
    "passed": true,
    "tests_passed": 8,
    "tests_failed": 0
  },
  "feature2": {
    "passed": true,
    "tests_passed": 12,
    "tests_failed": 0
  },
  "merge": {
    "status": "clean",
    "strategy": "apply"
  }
}

Command reference

Basic usage

cooperbench eval -n EXPERIMENT_NAME [OPTIONS]

-n, --name

string

required

Experiment name to evaluate (from logs/ directory)

cooperbench eval -n solo-msa-gemini-3-flash-lite

Filtering options

-s, --subset

string

Evaluate only tasks from a specific subset

cooperbench eval -n my-experiment -s lite

-r, --repo

string

Evaluate only tasks from a specific repository

cooperbench eval -n my-experiment -r llama_index_task

-t, --task

integer

Evaluate a specific task ID

cooperbench eval -n my-experiment -r llama_index_task -t 8394

-f, --features

string

Evaluate a specific feature pair (comma-separated)

cooperbench eval -n my-experiment -r llama_index_task -t 8394 -f 1,2

Execution options

-c, --concurrency

integer

default:"10"

Number of parallel evaluations

cooperbench eval -n my-experiment --concurrency 50

GCP backend can handle much higher concurrency (50-100+) than local Docker (5-10).

--backend

enum

default:"modal"

Evaluation backend: modal, docker, or gcp

# Use GCP for large-scale evaluation
cooperbench eval -n my-experiment --backend gcp --concurrency 100

# Use Docker for local testing
cooperbench eval -n my-experiment --backend docker --concurrency 5

--force

boolean

Force re-evaluation even if eval.json exists

cooperbench eval -n my-experiment --force

Evaluation metrics

Pass rate

The primary metric is pass rate: percentage of tasks where both features pass their tests.

pass_rate = passed / (passed + failed)

Passed: Both feature test suites pass
Failed: One or both feature test suites fail
Errors: Patch application or test execution errors (excluded from pass rate)

Per-feature results

Each feature is evaluated independently:

{
  "feature1": {
    "passed": true,
    "tests_passed": 8,
    "tests_failed": 0,
    "tests_total": 8,
    "output": "test_loader.py::test_load ... PASSED\n..."
  }
}

Merge analysis (cooperative mode)

Cooperative evaluations track merge success:

{
  "merge": {
    "status": "clean",
    "strategy": "apply"
  }
}

Output structure

Evaluation results are saved alongside experiment logs:

logs/
└── my-experiment/
    ├── eval_summary.json        # Aggregate evaluation stats
    └── solo/
        └── llama_index_task/
            └── 8394/
                └── f1_f2/
                    └── eval.json    # Task-specific evaluation

eval.json format

{
  "repo": "llama_index_task",
  "task_id": 8394,
  "features": [1, 2],
  "setting": "solo",
  "merge": null,
  "feature1": {
    "passed": true,
    "tests_passed": 8,
    "tests_failed": 0,
    "tests_total": 8,
    "output": "..."
  },
  "feature2": {
    "passed": true,
    "tests_passed": 12,
    "tests_failed": 0,
    "tests_total": 12,
    "output": "..."
  },
  "both_passed": true,
  "error": null,
  "evaluated_at": "2024-03-15T12:30:45"
}

eval_summary.json format

{
  "run_name": "my-experiment",
  "evaluated_at": "2024-03-15T12:40:00",
  "total_runs": 25,
  "passed": 23,
  "failed": 2,
  "errors": 0,
  "skipped": 0,
  "pass_rate": 0.92,
  "results": [
    {"run": "llama_index_task/8394/1,2", "status": "pass"},
    {"run": "dspy_task/142/1,3", "status": "fail"}
  ]
}

Examples

Evaluate specific repository

Test only LlamaIndex tasks:

cooperbench eval -n my-experiment -r llama_index_task

Output:

cooperbench eval my-experiment
runs: 12
backend: modal

✓ pass llama_index_task/8394 [1,2]
✓ pass llama_index_task/8421 [1,3]
✗ fail llama_index_task/8445 [2,3]
...

┌────────────┬─────┐
│ passed     │  11 │
│ failed     │   1 │
│ pass rate  │ 92% │
└────────────┴─────┘

Force re-evaluation

Re-run evaluation for all tasks:

cooperbench eval -n my-experiment --force

Large-scale GCP evaluation

Evaluate 1000+ tasks with high parallelism:

cooperbench eval \
  -n my-large-experiment \
  --backend gcp \
  --concurrency 100

Output:

cooperbench eval my-large-experiment
runs: 1247
backend: gcp

submitting to GCP Batch ━━━━━━━━━━━━━━━━━━━━━ 100%
provisioning VMs        ━━━━━━━━━━━━━━━━━━━━━ 100%
evaluating             ━━━━━━━━━━━━━━━━━━━━━ 100%
collecting results     ━━━━━━━━━━━━━━━━━━━━━ 100%

✓ pass llama_index_task/8394 [1,2]
✓ pass dspy_task/142 [1,3]
...

Single task evaluation

Debug a specific task with detailed output:

cooperbench eval -n my-experiment -r llama_index_task -t 8394 -f 1,2

Understanding evaluation failures

Patch application failures

If a patch fails to apply:

{
  "error": "patch does not apply",
  "both_passed": false
}

Common causes:

Agent modified wrong files
Patch format is invalid
Base commit mismatch

Test failures

If tests fail after applying patch:

{
  "feature1": {
    "passed": false,
    "tests_passed": 3,
    "tests_failed": 2,
    "output": "FAILED test_core.py::test_loader - AssertionError"
  }
}

Common causes:

Incomplete implementation
Logic errors in code
Side effects on existing tests

Merge conflicts

If patches conflict in cooperative mode:

{
  "merge": {
    "status": "conflict",
    "conflicted_files": ["src/core.py"]
  },
  "both_passed": false
}

Analysis:

Indicates agents modified overlapping code regions
Tests whether agents can work independently
Key metric for cooperation effectiveness

Next steps

Running experiments

Learn how to run CooperBench experiments

Backends

Choose the right evaluation backend

Get Started

Core Concepts

Guides

Results & Analysis

Overview

Quick start

How evaluation works

Command reference

Basic usage

Filtering options

Execution options

Evaluation metrics

Pass rate

Per-feature results

Merge analysis (cooperative mode)

Output structure

eval.json format

eval_summary.json format

Examples

Evaluate specific repository

Force re-evaluation

Large-scale GCP evaluation

Single task evaluation

Understanding evaluation failures

Patch application failures

Test failures

Merge conflicts

Next steps

Running experiments

Backends

Get Started

Core Concepts

Guides

Results & Analysis

​Overview

​Quick start

​How evaluation works

​Command reference

​Basic usage

​Filtering options

​Execution options

​Evaluation metrics

​Pass rate

​Per-feature results

​Merge analysis (cooperative mode)

​Output structure

​eval.json format

​eval_summary.json format

​Examples

​Evaluate specific repository

​Force re-evaluation

​Large-scale GCP evaluation

​Single task evaluation

​Understanding evaluation failures

​Patch application failures

​Test failures

​Merge conflicts

​Next steps

Running experiments

Backends

Overview

Quick start

How evaluation works

Command reference

Basic usage

Filtering options

Execution options

Evaluation metrics

Pass rate

Per-feature results

Merge analysis (cooperative mode)

Output structure

eval.json format

eval_summary.json format

Examples

Evaluate specific repository

Force re-evaluation

Large-scale GCP evaluation

Single task evaluation

Understanding evaluation failures

Patch application failures

Test failures

Merge conflicts

Next steps