Quick start

Before starting, ensure you’ve completed the installation and have:

An execution backend configured (Modal, GCP, or Docker)
Redis running (for cooperative mode)
LLM API keys in your .env file
Dataset downloaded to dataset/

Your first experiment

Let’s run a cooperative experiment with two agents working on a task from the llama_index repository.

Run cooperative agents

Execute a task with two agents, each implementing one feature:

cooperbench run -n my-first-experiment -r llama_index_task -m gpt-4o

This command:

Creates an experiment named my-first-experiment
Filters tasks to the llama_index_task repository
Uses GPT-4o as the LLM model
Runs in cooperative mode (default: two agents with messaging)
Uses Modal backend (default)
Enables automatic evaluation after completion

What happens during execution

CooperBench loads tasks from dataset/llama_index_task/
For each task, it spawns two sandboxed environments
Each agent receives one feature description and can:
- Read/write code files
- Run tests
- Send messages to the other agent via Redis
Agents work concurrently until completion or timeout
Generated patches are saved to logs/my-first-experiment/
Evaluation runs automatically (if --no-auto-eval not specified)

Monitor progress

Watch the output as agents work:

[CooperBench] Starting experiment: my-first-experiment
[CooperBench] Setting: coop (2 agents per task)
[CooperBench] Model: gpt-4o
[CooperBench] Backend: modal
[CooperBench] Found 12 tasks in llama_index_task
[CooperBench] Running with concurrency=30

[Task 1/12] llama_index_task/task8394 features_1_2
  Agent 1: Implementing feature 1...
  Agent 2: Implementing feature 2...
  Agent 1 → Agent 2: "I'm adding a new method to the BaseIndex class"
  Agent 2 → Agent 1: "Acknowledged, I'll avoid modifying that class"
[Task 1/12] ✓ Complete in 3m 42s

...

View results

After completion, check the results:

# Results are saved in logs/
ls logs/my-first-experiment/

Output structure:

logs/my-first-experiment/
  llama_index_task/
    task8394/
      features_1_2/
        agent1/
          trajectory.json     # Full agent interaction log
          patch.diff          # Generated code changes
        agent2/
          trajectory.json
          patch.diff
        eval.json             # Test results and metrics
    task8395/
      ...

Run a solo experiment

Compare cooperative performance against a single agent handling both features:

cooperbench run -n my-solo-experiment -r llama_index_task -m gpt-4o --setting solo

Solo experiments provide a baseline for measuring the coordination deficit. Running both settings on the same tasks reveals how much performance degrades with coordination.

Evaluate results

If you disabled auto-evaluation or want to re-evaluate:

cooperbench eval -n my-first-experiment

Evaluation tests each generated patch against the golden test suite:

{
  "agent1": {
    "tests_passed": 12,
    "tests_failed": 1,
    "success": false,
    "error": "AssertionError in test_feature1_edge_case"
  },
  "agent2": {
    "tests_passed": 8,
    "tests_failed": 0,
    "success": true,
    "error": null
  },
  "merged": {
    "tests_passed": 18,
    "tests_failed": 3,
    "success": false,
    "merge_conflict": true,
    "error": "Merge conflict in src/index.py"
  },
  "overall_success": false
}

Example output structure

Here’s what a complete experiment looks like:

logs/my-first-experiment/
  llama_index_task/
    task8394/
      features_1_2/          # Feature pair 1 and 2
        agent1/
          trajectory.json    # Agent's full interaction history
          patch.diff         # Generated changes
        agent2/
          trajectory.json
          patch.diff
        eval.json            # Test results
      features_1_3/          # Feature pair 1 and 3
        ...
    task8395/
      ...
  dspy_task/               # Another repository
    task1234/
      ...

Understanding trajectory.json

The trajectory.json file contains the complete interaction history:

{
  "agent_id": "agent1",
  "feature_id": 1,
  "model": "gpt-4o",
  "start_time": "2026-03-04T10:30:00Z",
  "end_time": "2026-03-04T10:33:42Z",
  "steps": [
    {
      "action": "read_file",
      "args": {"path": "src/index.py"},
      "observation": "...",
      "timestamp": "2026-03-04T10:30:05Z"
    },
    {
      "action": "send_message",
      "args": {"content": "I'm adding a new method..."},
      "observation": "Message sent",
      "timestamp": "2026-03-04T10:31:12Z"
    },
    ...
  ],
  "token_usage": {
    "prompt_tokens": 45230,
    "completion_tokens": 3821,
    "total_tokens": 49051
  }
}

Advanced usage

Filter by specific task

Run a single task by ID:

cooperbench run -n test-task -r llama_index_task -t 8394 -m gpt-4o

Run specific feature pairs

Test a particular combination of features:

cooperbench run -n test-features -r dspy_task -f 1,2 -m claude-sonnet-4.5

Use different backends

cooperbench run -n exp-modal --backend modal -r llama_index_task -m gpt-4o

Enable git collaboration

Allow agents to push/pull/merge via shared Git remote:

cooperbench run -n git-collab -r llama_index_task -m gpt-4o --git

Git collaboration is experimental and may increase coordination complexity. Start with messaging-only mode (default) first.

Disable messaging

Run agents without inter-agent communication:

cooperbench run -n no-msg -r llama_index_task -m gpt-4o --no-messaging

This reveals how much communication helps reduce conflicts.

Use different models

CooperBench supports any model via LiteLLM:

cooperbench run -n test -r llama_index_task -m gpt-4o
cooperbench run -n test -r llama_index_task -m gpt-5

Run on dataset subsets

Use predefined task subsets for faster iteration:

# Run on the "lite" subset (smaller, faster)
cooperbench run --setting solo -s lite -m gpt-4o

# Experiment name is auto-generated: solo-msa-gpt-4o-lite

Subsets are defined in dataset/subsets/. Check available subsets:

ls dataset/subsets/

Control concurrency

Adjust parallel execution for your backend’s capacity:

# Higher concurrency for cloud backends
cooperbench run -n test -r llama_index_task -m gpt-4o -c 50

# Lower concurrency for local Docker
cooperbench run -n test --backend docker -m gpt-4o -c 5

Python API

Use CooperBench programmatically:

from cooperbench import run, evaluate

# Run cooperative agents
run(
    run_name="my-experiment",
    repo="llama_index_task",
    model_name="gpt-4o",
    setting="coop",
    redis_url="redis://localhost:6379",
    backend="modal",
)

# Run solo baseline
run(
    run_name="my-solo-baseline",
    repo="llama_index_task",
    model_name="gpt-4o",
    setting="solo",
    backend="modal",
)

# Evaluate both
evaluate(run_name="my-experiment")
evaluate(run_name="my-solo-baseline")

Available Python API parameters

def run(
    run_name: str,                    # Experiment name (required)
    subset: str | None = None,        # Dataset subset (e.g., "lite")
    repo: str | None = None,          # Filter by repository
    task_id: int | None = None,       # Filter by task ID
    features: list[int] | None = None,  # Specific feature pair
    model_name: str = "vertex_ai/gemini-3-flash-preview",
    agent: str = "mini_swe_agent",    # Agent framework
    concurrency: int = 30,            # Parallel tasks
    setting: str = "coop",            # "coop" or "solo"
    redis_url: str = "redis://localhost:6379",
    force: bool = False,              # Rerun existing results
    git_enabled: bool = False,        # Enable git collaboration
    messaging_enabled: bool = True,   # Enable messaging
    auto_eval: bool = True,           # Auto-evaluate after run
    eval_concurrency: int = 10,
    backend: str = "modal",           # "modal", "gcp", or "docker"
    agent_config: str | None = None,  # Path to agent config file
) -> None:
    ...

Understanding results

Success metrics

Each evaluation provides several success indicators:

Individual success: Did each agent’s patch pass its own tests?
Merge success: Did the patches merge without conflicts?
Overall success: Did the merged result pass all tests?

A task succeeds only if all agents’ patches pass their tests AND the merged result passes all combined tests.

Common failure modes

Expectation failures (42%)

Agents fail to integrate partner state information. For example:

Agent 1 adds a new parameter to a function
Agent 2 calls that function without the new parameter
Tests fail despite no merge conflicts

Communication failures (26%)

Questions go unanswered, breaking decision loops:

Agent 1 asks: “Should I use sync or async?”
Agent 2 doesn’t respond or gives an unclear answer
Agent 1 proceeds with a guess that conflicts with Agent 2’s choice

Commitment failures (32%)

Agents break promises or make unverifiable claims:

Agent 1 promises to “add type hints”
Agent 1’s patch has incomplete type annotations
Agent 2’s code assumes full type coverage and fails

Next steps

CLI reference

Explore all CLI commands and options

Configuration

Configure backends, agents, and models

Dataset

Understand the benchmark dataset structure

Evaluation

Deep dive into evaluation metrics

Get Started

Core Concepts

Guides

Results & Analysis

Your first experiment

Run a solo experiment

Evaluate results

Example output structure

Advanced usage

Filter by specific task

Run specific feature pairs

Use different backends

Enable git collaboration

Disable messaging

Use different models

Run on dataset subsets

Control concurrency

Python API

Understanding results

Success metrics

Common failure modes

Next steps

CLI reference

Configuration

Dataset

Evaluation

Get Started

Core Concepts

Guides

Results & Analysis

​Your first experiment

​Run a solo experiment

​Evaluate results

​Example output structure

​Advanced usage

​Filter by specific task

​Run specific feature pairs

​Use different backends

​Enable git collaboration

​Disable messaging

​Use different models

​Run on dataset subsets

​Control concurrency

​Python API

​Understanding results

​Success metrics

​Common failure modes

​Next steps

CLI reference

Configuration

Dataset

Evaluation

Your first experiment

Run a solo experiment

Evaluate results

Example output structure

Advanced usage

Filter by specific task

Run specific feature pairs

Use different backends

Enable git collaboration

Disable messaging

Use different models

Run on dataset subsets

Control concurrency

Python API

Understanding results

Success metrics

Common failure modes

Next steps