Skip to main content
Test agent-generated patches against feature tests to measure task success. Evaluation can be run automatically after task execution or separately for completed runs.

Function signature

from cooperbench import evaluate

evaluate(
    run_name: str,
    subset: str | None = None,
    repo: str | None = None,
    task_id: int | None = None,
    features: list[int] | None = None,
    concurrency: int = 10,
    force: bool = False,
    backend: str = "modal",
) -> None

Parameters

run_name
str
required
Name of the run to evaluate. Corresponds to the run name used in run().
subset
str | None
default:"None"
Filter evaluation to a specific subset (e.g., "lite").
repo
str | None
default:"None"
Filter by repository name (e.g., "llama_index_task").
task_id
int | None
default:"None"
Filter to a specific task ID.
features
list[int] | None
default:"None"
Specific feature pair to evaluate (e.g., [1, 2]).
concurrency
int
default:"10"
Number of parallel evaluations to run.
force
bool
default:"False"
If True, re-evaluates even if eval.json already exists.
backend
str
default:"modal"
Evaluation backend: "modal", "docker", or "gcp".

Basic usage

Evaluate all runs

from cooperbench import evaluate

# Evaluate all tasks in a run
evaluate(run_name="my_experiment")

Evaluate a subset

# Evaluate only lite subset tasks
evaluate(
    run_name="my_experiment",
    subset="lite",
)

Evaluate specific tasks

# Evaluate a single task
evaluate(
    run_name="my_experiment",
    repo="llama_index_task",
    task_id=1,
    features=[1, 2],
)

Advanced usage

Force re-evaluation

# Re-run evaluation even if results exist
evaluate(
    run_name="my_experiment",
    force=True,
)

Use different backend

# Evaluate using Docker instead of Modal
evaluate(
    run_name="my_experiment",
    backend="docker",
    concurrency=5,
)

High-concurrency evaluation

# Evaluate many tasks in parallel
evaluate(
    run_name="large_experiment",
    concurrency=50,
    backend="modal",
)

How evaluation works

Cooperative mode

For cooperative runs, evaluation:
  1. Merges patches from both agents (agent1.patch + agent2.patch)
  2. Applies the merged patch to the repository
  3. Applies feature 1 tests and runs them
  4. Applies feature 2 tests and runs them
  5. Reports whether both feature tests pass

Solo mode

For solo runs, evaluation:
  1. Applies the single agent’s patch (solo.patch)
  2. Applies feature 1 tests and runs them
  3. Applies feature 2 tests and runs them
  4. Reports whether both feature tests pass

Evaluation results

Results are saved to logs/{run_name}/{setting}/{repo}/{task_id}/{features}/eval.json:
{
  "repo": "llama_index_task",
  "task_id": 1,
  "features": [1, 2],
  "setting": "coop",
  "merge": {
    "status": "success",
    "strategy": "recursive"
  },
  "feature1": {
    "passed": true,
    "tests_passed": 5,
    "tests_failed": 0,
    "test_output": "..."
  },
  "feature2": {
    "passed": true,
    "tests_passed": 3,
    "tests_failed": 0,
    "test_output": "..."
  },
  "both_passed": true,
  "error": null,
  "evaluated_at": "2024-03-04T10:30:00"
}

Result fields

repo
str
Repository name
task_id
int
Task identifier
features
list[int]
Feature pair that was tested
setting
str
Execution mode ("coop" or "solo")
merge
object | null
Merge information (cooperative mode only):
  • status: "success", "conflict", or "failed"
  • strategy: Git merge strategy used
feature1
object
Feature 1 test results:
  • passed: Whether all tests passed
  • tests_passed: Number of passing tests
  • tests_failed: Number of failing tests
  • test_output: Full test output
feature2
object
Feature 2 test results (same structure as feature1)
both_passed
bool
True if both feature tests passed
error
str | null
Error message if evaluation failed
evaluated_at
str
ISO timestamp of evaluation

Summary output

A summary is also saved to logs/{run_name}/eval_summary.json:
{
  "run_name": "my_experiment",
  "evaluated_at": "2024-03-04T10:30:00",
  "total_runs": 100,
  "passed": 75,
  "failed": 20,
  "errors": 5,
  "skipped": 0,
  "pass_rate": 0.75,
  "results": [
    {"run": "llama_index_task/1/f1,f2", "status": "pass"},
    {"run": "django_task/5/f1,f3", "status": "fail"}
  ]
}

Low-level testing functions

test_merged()

Test merged patches from two agents (cooperative mode).
from cooperbench import test_merged

result = test_merged(
    repo_name="llama_index_task",
    task_id=1,
    feature1_id=1,
    feature2_id=2,
    patch1="diff --git a/file.py...",
    patch2="diff --git a/other.py...",
    timeout=600,
    backend="modal",
)

print(result["both_passed"])  # True if both features pass
print(result["merge"]["status"])  # "success", "conflict", or "failed"

Parameters

repo_name
str
required
Repository name
task_id
int
required
Task ID
feature1_id
int
required
First feature ID
feature2_id
int
required
Second feature ID
patch1
str | Path | None
default:"None"
First agent’s patch (as string or path to .patch file)
patch2
str | Path | None
default:"None"
Second agent’s patch (as string or path to .patch file)
timeout
int
default:"600"
Maximum execution time in seconds
backend
str
default:"modal"
Evaluation backend

test_solo()

Test a single patch against both feature tests (solo mode).
from cooperbench import test_solo

result = test_solo(
    repo_name="llama_index_task",
    task_id=1,
    feature1_id=1,
    feature2_id=2,
    patch="diff --git a/file.py...",
    timeout=600,
    backend="modal",
)

print(result["both_passed"])  # True if both features pass
Parameters are identical to test_merged() except only one patch parameter is used.

run_patch_test()

Test a patch against a single feature’s tests.
from cooperbench import run_patch_test

result = run_patch_test(
    repo_name="llama_index_task",
    task_id=1,
    feature_id=1,
    agent_patch="diff --git a/file.py...",
    timeout=600,
    backend="modal",
)

print(f"Passed: {result['passed']}")
print(f"Tests passed: {result['tests_passed']}/{result['tests_total']}")

Parameters

repo_name
str
required
Repository name
task_id
int
required
Task ID
feature_id
int
required
Feature ID to test
agent_patch
str | Path | None
default:"None"
Agent’s patch (as string or path). If None, uses gold patch from dataset.
timeout
int
default:"600"
Maximum execution time in seconds
backend
str
default:"modal"
Evaluation backend

evaluate_merge()

Wrapper for test_merged() that returns results in a format compatible with training code.
from cooperbench import evaluate_merge

result = evaluate_merge(
    repo_name="llama_index_task",
    task_id=1,
    feature1_id=1,
    feature2_id=2,
    patch1="diff --git a/file.py...",
    patch2="diff --git a/other.py...",
)

print(f"Feature 1: {result['feature1_tests_passed']}/{result['feature1_tests_total']}")
print(f"Feature 2: {result['feature2_tests_passed']}/{result['feature2_tests_total']}")
if result.get('error'):
    print(f"Error: {result['error']}")

Parameters

Parameters are identical to test_merged().

Returns

Returns a dictionary with the following keys:
feature1_tests_passed
int
Number of tests passed for feature 1
feature1_tests_total
int
Total number of tests for feature 1
feature2_tests_passed
int
Number of tests passed for feature 2
feature2_tests_total
int
Total number of tests for feature 2
error
str | None
Error message if evaluation failed, None otherwise
This function is primarily used for training and benchmarking workflows that require a specific result format.