evaluate() - CooperBench

Test agent-generated patches against feature tests to measure task success. Evaluation can be run automatically after task execution or separately for completed runs.

Function signature

from cooperbench import evaluate

evaluate(
    run_name: str,
    subset: str | None = None,
    repo: str | None = None,
    task_id: int | None = None,
    features: list[int] | None = None,
    concurrency: int = 10,
    force: bool = False,
    backend: str = "modal",
) -> None

Parameters

run_name

str

required

Name of the run to evaluate. Corresponds to the run name used in run().

subset

str | None

default:"None"

Filter evaluation to a specific subset (e.g., "lite").

repo

str | None

default:"None"

Filter by repository name (e.g., "llama_index_task").

task_id

int | None

default:"None"

Filter to a specific task ID.

features

list[int] | None

default:"None"

Specific feature pair to evaluate (e.g., [1, 2]).

concurrency

int

default:"10"

Number of parallel evaluations to run.

force

bool

default:"False"

If True, re-evaluates even if eval.json already exists.

backend

str

default:"modal"

Evaluation backend: "modal", "docker", or "gcp".

Basic usage

Evaluate all runs

from cooperbench import evaluate

# Evaluate all tasks in a run
evaluate(run_name="my_experiment")

Evaluate a subset

# Evaluate only lite subset tasks
evaluate(
    run_name="my_experiment",
    subset="lite",
)

Evaluate specific tasks

# Evaluate a single task
evaluate(
    run_name="my_experiment",
    repo="llama_index_task",
    task_id=1,
    features=[1, 2],
)

Advanced usage

Force re-evaluation

# Re-run evaluation even if results exist
evaluate(
    run_name="my_experiment",
    force=True,
)

Use different backend

# Evaluate using Docker instead of Modal
evaluate(
    run_name="my_experiment",
    backend="docker",
    concurrency=5,
)

High-concurrency evaluation

# Evaluate many tasks in parallel
evaluate(
    run_name="large_experiment",
    concurrency=50,
    backend="modal",
)

How evaluation works

Cooperative mode

For cooperative runs, evaluation:

Merges patches from both agents (agent1.patch + agent2.patch)
Applies the merged patch to the repository
Applies feature 1 tests and runs them
Applies feature 2 tests and runs them
Reports whether both feature tests pass

Solo mode

For solo runs, evaluation:

Applies the single agent’s patch (solo.patch)
Applies feature 1 tests and runs them
Applies feature 2 tests and runs them
Reports whether both feature tests pass

Evaluation results

Results are saved to logs/{run_name}/{setting}/{repo}/{task_id}/{features}/eval.json:

{
  "repo": "llama_index_task",
  "task_id": 1,
  "features": [1, 2],
  "setting": "coop",
  "merge": {
    "status": "success",
    "strategy": "recursive"
  },
  "feature1": {
    "passed": true,
    "tests_passed": 5,
    "tests_failed": 0,
    "test_output": "..."
  },
  "feature2": {
    "passed": true,
    "tests_passed": 3,
    "tests_failed": 0,
    "test_output": "..."
  },
  "both_passed": true,
  "error": null,
  "evaluated_at": "2024-03-04T10:30:00"
}

Result fields

repo

str

Repository name

task_id

int

Task identifier

features

list[int]

Feature pair that was tested

setting

str

Execution mode ("coop" or "solo")

merge

object | null

Merge information (cooperative mode only):

status: "success", "conflict", or "failed"
strategy: Git merge strategy used

feature1

object

Feature 1 test results:

passed: Whether all tests passed
tests_passed: Number of passing tests
tests_failed: Number of failing tests
test_output: Full test output

feature2

object

Feature 2 test results (same structure as feature1)

both_passed

bool

True if both feature tests passed

error

str | null

Error message if evaluation failed

evaluated_at

str

ISO timestamp of evaluation

Summary output

A summary is also saved to logs/{run_name}/eval_summary.json:

{
  "run_name": "my_experiment",
  "evaluated_at": "2024-03-04T10:30:00",
  "total_runs": 100,
  "passed": 75,
  "failed": 20,
  "errors": 5,
  "skipped": 0,
  "pass_rate": 0.75,
  "results": [
    {"run": "llama_index_task/1/f1,f2", "status": "pass"},
    {"run": "django_task/5/f1,f3", "status": "fail"}
  ]
}

run() - Execute benchmark tasks
discover_runs() - Query completed runs
test_merged() - Low-level patch testing (cooperative)
test_solo() - Low-level patch testing (solo)
evaluate_merge() - Training-compatible wrapper for test_merged()

Low-level testing functions

test_merged()

Test merged patches from two agents (cooperative mode).

from cooperbench import test_merged

result = test_merged(
    repo_name="llama_index_task",
    task_id=1,
    feature1_id=1,
    feature2_id=2,
    patch1="diff --git a/file.py...",
    patch2="diff --git a/other.py...",
    timeout=600,
    backend="modal",
)

print(result["both_passed"])  # True if both features pass
print(result["merge"]["status"])  # "success", "conflict", or "failed"

Parameters

repo_name

str

required

Repository name

task_id

int

required

Task ID

feature1_id

int

required

First feature ID

feature2_id

int

required

Second feature ID

patch1

str | Path | None

default:"None"

First agent’s patch (as string or path to .patch file)

patch2

str | Path | None

default:"None"

Second agent’s patch (as string or path to .patch file)

timeout

int

default:"600"

Maximum execution time in seconds

backend

str

default:"modal"

Evaluation backend

test_solo()

Test a single patch against both feature tests (solo mode).

from cooperbench import test_solo

result = test_solo(
    repo_name="llama_index_task",
    task_id=1,
    feature1_id=1,
    feature2_id=2,
    patch="diff --git a/file.py...",
    timeout=600,
    backend="modal",
)

print(result["both_passed"])  # True if both features pass

Parameters are identical to test_merged() except only one patch parameter is used.

run_patch_test()

Test a patch against a single feature’s tests.

from cooperbench import run_patch_test

result = run_patch_test(
    repo_name="llama_index_task",
    task_id=1,
    feature_id=1,
    agent_patch="diff --git a/file.py...",
    timeout=600,
    backend="modal",
)

print(f"Passed: {result['passed']}")
print(f"Tests passed: {result['tests_passed']}/{result['tests_total']}")

Parameters

repo_name

str

required

Repository name

task_id

int

required

Task ID

feature_id

int

required

Feature ID to test

agent_patch

str | Path | None

default:"None"

Agent’s patch (as string or path). If None, uses gold patch from dataset.

timeout

int

default:"600"

Maximum execution time in seconds

backend

str

default:"modal"

Evaluation backend

evaluate_merge()

Wrapper for test_merged() that returns results in a format compatible with training code.

from cooperbench import evaluate_merge

result = evaluate_merge(
    repo_name="llama_index_task",
    task_id=1,
    feature1_id=1,
    feature2_id=2,
    patch1="diff --git a/file.py...",
    patch2="diff --git a/other.py...",
)

print(f"Feature 1: {result['feature1_tests_passed']}/{result['feature1_tests_total']}")
print(f"Feature 2: {result['feature2_tests_passed']}/{result['feature2_tests_total']}")
if result.get('error'):
    print(f"Error: {result['error']}")

Parameters

Parameters are identical to test_merged().

Returns

Returns a dictionary with the following keys:

feature1_tests_passed

int

Number of tests passed for feature 1

feature1_tests_total

int

Total number of tests for feature 1

feature2_tests_passed

int

Number of tests passed for feature 2

feature2_tests_total

int

Total number of tests for feature 2

error

str | None

Error message if evaluation failed, None otherwise

This function is primarily used for training and benchmarking workflows that require a specific result format.

Core Functions

Advanced

​Function signature

​Parameters

​Basic usage

​Evaluate all runs

​Evaluate a subset

​Evaluate specific tasks

​Advanced usage

​Force re-evaluation

​Use different backend

​High-concurrency evaluation

​How evaluation works

​Cooperative mode

​Solo mode

​Evaluation results

​Result fields

​Summary output

​Related functions

​Low-level testing functions

​test_merged()

​Parameters

​test_solo()

​run_patch_test()

​Parameters

​evaluate_merge()

​Parameters

​Returns

Function signature

Parameters

Basic usage

Evaluate all runs

Evaluate a subset

Evaluate specific tasks

Advanced usage

Force re-evaluation

Use different backend

High-concurrency evaluation

How evaluation works

Cooperative mode

Solo mode

Evaluation results

Result fields

Summary output

Related functions

Low-level testing functions

test_merged()

Parameters

test_solo()

run_patch_test()

Parameters

evaluate_merge()

Parameters

Returns