Skip to main content
The cooperbench eval command evaluates agent runs by executing test suites in isolated sandboxes and computing success metrics.

Usage

cooperbench eval -n <experiment_name> [options]

Basic examples

Evaluate an experiment

cooperbench eval -n my-experiment
Evaluates all tasks in the logs/my-experiment/ directory.

Force re-evaluation

cooperbench eval -n my-experiment --force
Re-evaluates even if eval.json already exists.

Evaluate specific tasks

cooperbench eval -n my-experiment -t 8394
Evaluates only task 8394.

Parameters

Required

-n, --name
string
required
Experiment name to evaluate. Must match a directory in logs/.Example: my-experiment (evaluates logs/my-experiment/)

Task filtering

-s, --subset
string
Use a predefined task subset.Example: lite
-r, --repo
string
Filter by repository name.Example: llama_index_task
-t, --task
integer
Filter by specific task ID.Example: 8394
-f, --features
string
Specific feature pair to evaluate, comma-separated.Example: 1,2

Execution

-c, --concurrency
integer
default:"10"
Number of parallel evaluations.Default: 10
--backend
choice
default:"modal"
Execution backend for running test suites.Options:
  • modal - Modal cloud platform (default)
  • docker - Local Docker containers
  • gcp - Google Cloud Platform Batch jobs
--force
flag
Force re-evaluation even if eval.json exists.

How evaluation works

For each task instance:
  1. Load agent patches - Reads patch.diff from agent logs
  2. Create sandbox - Spins up isolated container with repository
  3. Apply patches - Applies agent changes to codebase
  4. Run tests - Executes test suite defined in task metadata
  5. Compute results - Records pass/fail for each test
  6. Save results - Writes eval.json with test outcomes

Evaluation output

Results are saved to:
logs/{experiment_name}/task_{id}_feature_{f1}_{f2}/eval.json

Example eval.json

{
  "task_id": 8394,
  "features": [1, 2],
  "tests_passed": 12,
  "tests_failed": 2,
  "tests_total": 14,
  "success": false,
  "test_results": [
    {
      "test_name": "test_feature_1",
      "status": "passed",
      "duration": 0.45
    },
    {
      "test_name": "test_feature_2",
      "status": "failed",
      "error": "AssertionError: expected 42, got 41"
    }
  ],
  "duration_seconds": 125.3
}

Filtering examples

Evaluate specific subset

cooperbench eval -n exp-123 -s lite
Only evaluates tasks in the lite subset.

Evaluate specific repository

cooperbench eval -n exp-123 -r dspy_task

Evaluate specific task

cooperbench eval -n exp-123 -t 8394

Evaluate specific feature pair

cooperbench eval -n exp-123 -t 8394 -f 1,2
Evaluates only features 1 and 2 of task 8394.

Combine filters

cooperbench eval -n exp-123 -s lite -r llama_index_task

Backend examples

Evaluate on Modal (cloud)

cooperbench eval -n my-experiment --backend modal
Default. Runs evaluation sandboxes on Modal.

Evaluate locally with Docker

cooperbench eval -n my-experiment --backend docker
Runs evaluation in local Docker containers. Requires Docker installed.

Evaluate on GCP Batch

cooperbench eval -n my-experiment --backend gcp
Runs evaluation on GCP Batch jobs. Requires cooperbench config gcp first.

Performance tuning

High concurrency for cloud

cooperbench eval -n my-experiment -c 50 --backend modal
Runs 50 evaluations in parallel on Modal.

Low concurrency for local

cooperbench eval -n my-experiment -c 2 --backend docker
Runs only 2 evaluations in parallel locally to avoid resource exhaustion.

Skip auto-evaluation

By default, cooperbench run automatically evaluates after completion. To disable:
cooperbench run --no-auto-eval
Then evaluate manually later:
cooperbench eval -n <experiment_name>

Incremental evaluation

Evaluation skips tasks that already have eval.json:
cooperbench eval -n my-experiment
To force re-evaluation:
cooperbench eval -n my-experiment --force

Aggregate results

After evaluation, you can aggregate results across all tasks:
ls logs/my-experiment/*/eval.json | xargs jq '.success' | grep -c true
Or use Python:
import json
from pathlib import Path

exp_dir = Path("logs/my-experiment")
successes = 0
total = 0

for eval_file in exp_dir.glob("*/eval.json"):
    with open(eval_file) as f:
        result = json.load(f)
        total += 1
        if result["success"]:
            successes += 1

print(f"Success rate: {successes}/{total} ({100*successes/total:.1f}%)")