run() - CooperBench

Execute CooperBench tasks with customizable agents, models, and execution settings. This is the primary entry point for running benchmarks.

Function signature

from cooperbench import run

run(
    run_name: str,
    subset: str | None = None,
    repo: str | None = None,
    task_id: int | None = None,
    features: list[int] | None = None,
    model_name: str = "vertex_ai/gemini-3-flash-preview",
    agent: str = "mini_swe_agent",
    concurrency: int = 20,
    force: bool = False,
    redis_url: str = "redis://localhost:6379",
    setting: str = "coop",
    git_enabled: bool = False,
    messaging_enabled: bool = True,
    auto_eval: bool = True,
    eval_concurrency: int = 10,
    backend: str = "modal",
    agent_config: str | None = None,
) -> None

Parameters

run_name

str

required

Experiment name used for organizing logs. Creates a directory at logs/{run_name}/.

subset

str | None

default:"None"

Use a predefined task subset (e.g., "lite" for quick testing). Subsets are defined in dataset/subsets/.

repo

str | None

default:"None"

Filter tasks by repository name (e.g., "llama_index_task"). Runs only tasks from this repository.

task_id

int | None

default:"None"

Filter to a specific task ID. Useful for debugging individual tasks.

features

list[int] | None

default:"None"

Specific feature pair to run (e.g., [1, 2]). If not specified, runs all feature combinations.

model_name

str

default:"vertex_ai/gemini-3-flash-preview"

LLM model identifier. Supports OpenAI models (e.g., "gpt-4o"), Anthropic models (e.g., "claude-3-5-sonnet-20241022"), and Vertex AI models (e.g., "vertex_ai/gemini-3-flash-preview").

agent

str

default:"mini_swe_agent"

Agent framework to use. Available agents: "mini_swe_agent", "swe_agent", "openhands", "mini_swe_agent_v2".

concurrency

int

default:"20"

Maximum number of tasks to run in parallel.

force

bool

default:"False"

If True, reruns tasks even if results already exist.

redis_url

str

default:"redis://localhost:6379"

Redis server URL for agent communication in cooperative mode. Required when messaging_enabled=True.

setting

str

default:"coop"

Execution mode:

"coop": Two agents collaborate on different features
"solo": Single agent implements both features

git_enabled

bool

default:"False"

Enable git collaboration features (push, pull, merge). Only applies to cooperative mode.

messaging_enabled

bool

default:"True"

Enable agent-to-agent messaging via the send_message command. Requires Redis in cooperative mode.

auto_eval

bool

default:"True"

Automatically evaluate runs after completion. Results are saved to eval.json.

eval_concurrency

int

default:"10"

Maximum number of parallel evaluations when auto_eval=True.

backend

str

default:"modal"

Execution backend: "modal", "docker", or "gcp".

agent_config

str | None

default:"None"

Path to agent-specific configuration file (optional).

Basic usage

Run a single task

from cooperbench import run

# Run one task in cooperative mode
run(
    run_name="my_experiment",
    repo="llama_index_task",
    task_id=1,
    features=[1, 2],
    model_name="gpt-4o",
)

Run all tasks in a subset

# Run the lite subset (faster iteration)
run(
    run_name="lite_test",
    subset="lite",
    model_name="vertex_ai/gemini-3-flash-preview",
    concurrency=10,
)

Run in solo mode

# Single agent implements both features
run(
    run_name="solo_baseline",
    subset="lite",
    setting="solo",
    model_name="claude-3-5-sonnet-20241022",
)

Advanced usage

Custom agent and backend

# Use SWE-agent with Docker backend
run(
    run_name="swe_agent_test",
    repo="django_task",
    task_id=5,
    agent="swe_agent",
    backend="docker",
    model_name="gpt-4o",
)

Enable git collaboration

# Agents can push/pull/merge via git
run(
    run_name="git_collab",
    subset="lite",
    git_enabled=True,
    messaging_enabled=True,
    model_name="gpt-4o",
)

Force rerun with custom concurrency

# Rerun existing tasks with higher parallelism
run(
    run_name="rerun_experiment",
    subset="lite",
    force=True,
    concurrency=50,
    eval_concurrency=20,
)

Skip automatic evaluation

# Run tasks without evaluating
run(
    run_name="quick_run",
    subset="lite",
    auto_eval=False,
)

# Evaluate later using cooperbench.evaluate()
from cooperbench import evaluate
evaluate(run_name="quick_run", subset="lite")

Output structure

Results are saved to logs/{run_name}/{setting}/{repo}/{task_id}/{features}/:

logs/my_experiment/
├── config.json          # Run configuration
├── summary.json         # Session summary with costs and pass rates
└── coop/
    └── llama_index_task/
        └── 1/
            └── f1_f2/
                ├── result.json      # Task execution results
                ├── agent1.patch     # Agent 1's code changes
                ├── agent2.patch     # Agent 2's code changes
                ├── agent1.log       # Agent 1's execution log
                ├── agent2.log       # Agent 2's execution log
                └── eval.json        # Evaluation results (if auto_eval=True)

Return value

This function returns None. Results are saved to the logs directory and printed to the console.

evaluate() - Evaluate completed runs
discover_tasks() - Query available tasks

Core Functions

Advanced

​Function signature

​Parameters

​Basic usage

​Run a single task

​Run all tasks in a subset

​Run in solo mode

​Advanced usage

​Custom agent and backend

​Enable git collaboration

​Force rerun with custom concurrency

​Skip automatic evaluation

​Output structure

​Return value

​Related functions

Function signature

Parameters

Basic usage

Run a single task

Run all tasks in a subset

Run in solo mode

Advanced usage

Custom agent and backend

Enable git collaboration

Force rerun with custom concurrency

Skip automatic evaluation

Output structure

Return value

Related functions