Skip to main content
Before starting, ensure you’ve completed the installation and have:
  • An execution backend configured (Modal, GCP, or Docker)
  • Redis running (for cooperative mode)
  • LLM API keys in your .env file
  • Dataset downloaded to dataset/

Your first experiment

Let’s run a cooperative experiment with two agents working on a task from the llama_index repository.
1

Run cooperative agents

Execute a task with two agents, each implementing one feature:
cooperbench run -n my-first-experiment -r llama_index_task -m gpt-4o
This command:
  • Creates an experiment named my-first-experiment
  • Filters tasks to the llama_index_task repository
  • Uses GPT-4o as the LLM model
  • Runs in cooperative mode (default: two agents with messaging)
  • Uses Modal backend (default)
  • Enables automatic evaluation after completion
  1. CooperBench loads tasks from dataset/llama_index_task/
  2. For each task, it spawns two sandboxed environments
  3. Each agent receives one feature description and can:
    • Read/write code files
    • Run tests
    • Send messages to the other agent via Redis
  4. Agents work concurrently until completion or timeout
  5. Generated patches are saved to logs/my-first-experiment/
  6. Evaluation runs automatically (if --no-auto-eval not specified)
2

Monitor progress

Watch the output as agents work:
[CooperBench] Starting experiment: my-first-experiment
[CooperBench] Setting: coop (2 agents per task)
[CooperBench] Model: gpt-4o
[CooperBench] Backend: modal
[CooperBench] Found 12 tasks in llama_index_task
[CooperBench] Running with concurrency=30

[Task 1/12] llama_index_task/task8394 features_1_2
  Agent 1: Implementing feature 1...
  Agent 2: Implementing feature 2...
  Agent 1 → Agent 2: "I'm adding a new method to the BaseIndex class"
  Agent 2 → Agent 1: "Acknowledged, I'll avoid modifying that class"
[Task 1/12] ✓ Complete in 3m 42s

...
3

View results

After completion, check the results:
# Results are saved in logs/
ls logs/my-first-experiment/
Output structure:
logs/my-first-experiment/
  llama_index_task/
    task8394/
      features_1_2/
        agent1/
          trajectory.json     # Full agent interaction log
          patch.diff          # Generated code changes
        agent2/
          trajectory.json
          patch.diff
        eval.json             # Test results and metrics
    task8395/
      ...

Run a solo experiment

Compare cooperative performance against a single agent handling both features:
cooperbench run -n my-solo-experiment -r llama_index_task -m gpt-4o --setting solo
Solo experiments provide a baseline for measuring the coordination deficit. Running both settings on the same tasks reveals how much performance degrades with coordination.

Evaluate results

If you disabled auto-evaluation or want to re-evaluate:
cooperbench eval -n my-first-experiment
Evaluation tests each generated patch against the golden test suite:
{
  "agent1": {
    "tests_passed": 12,
    "tests_failed": 1,
    "success": false,
    "error": "AssertionError in test_feature1_edge_case"
  },
  "agent2": {
    "tests_passed": 8,
    "tests_failed": 0,
    "success": true,
    "error": null
  },
  "merged": {
    "tests_passed": 18,
    "tests_failed": 3,
    "success": false,
    "merge_conflict": true,
    "error": "Merge conflict in src/index.py"
  },
  "overall_success": false
}

Example output structure

Here’s what a complete experiment looks like:
logs/my-first-experiment/
  llama_index_task/
    task8394/
      features_1_2/          # Feature pair 1 and 2
        agent1/
          trajectory.json    # Agent's full interaction history
          patch.diff         # Generated changes
        agent2/
          trajectory.json
          patch.diff
        eval.json            # Test results
      features_1_3/          # Feature pair 1 and 3
        ...
    task8395/
      ...
  dspy_task/               # Another repository
    task1234/
      ...
The trajectory.json file contains the complete interaction history:
{
  "agent_id": "agent1",
  "feature_id": 1,
  "model": "gpt-4o",
  "start_time": "2026-03-04T10:30:00Z",
  "end_time": "2026-03-04T10:33:42Z",
  "steps": [
    {
      "action": "read_file",
      "args": {"path": "src/index.py"},
      "observation": "...",
      "timestamp": "2026-03-04T10:30:05Z"
    },
    {
      "action": "send_message",
      "args": {"content": "I'm adding a new method..."},
      "observation": "Message sent",
      "timestamp": "2026-03-04T10:31:12Z"
    },
    ...
  ],
  "token_usage": {
    "prompt_tokens": 45230,
    "completion_tokens": 3821,
    "total_tokens": 49051
  }
}

Advanced usage

Filter by specific task

Run a single task by ID:
cooperbench run -n test-task -r llama_index_task -t 8394 -m gpt-4o

Run specific feature pairs

Test a particular combination of features:
cooperbench run -n test-features -r dspy_task -f 1,2 -m claude-sonnet-4.5

Use different backends

cooperbench run -n exp-modal --backend modal -r llama_index_task -m gpt-4o

Enable git collaboration

Allow agents to push/pull/merge via shared Git remote:
cooperbench run -n git-collab -r llama_index_task -m gpt-4o --git
Git collaboration is experimental and may increase coordination complexity. Start with messaging-only mode (default) first.

Disable messaging

Run agents without inter-agent communication:
cooperbench run -n no-msg -r llama_index_task -m gpt-4o --no-messaging
This reveals how much communication helps reduce conflicts.

Use different models

CooperBench supports any model via LiteLLM:
cooperbench run -n test -r llama_index_task -m gpt-4o
cooperbench run -n test -r llama_index_task -m gpt-5

Run on dataset subsets

Use predefined task subsets for faster iteration:
# Run on the "lite" subset (smaller, faster)
cooperbench run --setting solo -s lite -m gpt-4o

# Experiment name is auto-generated: solo-msa-gpt-4o-lite
Subsets are defined in dataset/subsets/. Check available subsets:
ls dataset/subsets/

Control concurrency

Adjust parallel execution for your backend’s capacity:
# Higher concurrency for cloud backends
cooperbench run -n test -r llama_index_task -m gpt-4o -c 50

# Lower concurrency for local Docker
cooperbench run -n test --backend docker -m gpt-4o -c 5

Python API

Use CooperBench programmatically:
from cooperbench import run, evaluate

# Run cooperative agents
run(
    run_name="my-experiment",
    repo="llama_index_task",
    model_name="gpt-4o",
    setting="coop",
    redis_url="redis://localhost:6379",
    backend="modal",
)

# Run solo baseline
run(
    run_name="my-solo-baseline",
    repo="llama_index_task",
    model_name="gpt-4o",
    setting="solo",
    backend="modal",
)

# Evaluate both
evaluate(run_name="my-experiment")
evaluate(run_name="my-solo-baseline")
def run(
    run_name: str,                    # Experiment name (required)
    subset: str | None = None,        # Dataset subset (e.g., "lite")
    repo: str | None = None,          # Filter by repository
    task_id: int | None = None,       # Filter by task ID
    features: list[int] | None = None,  # Specific feature pair
    model_name: str = "vertex_ai/gemini-3-flash-preview",
    agent: str = "mini_swe_agent",    # Agent framework
    concurrency: int = 30,            # Parallel tasks
    setting: str = "coop",            # "coop" or "solo"
    redis_url: str = "redis://localhost:6379",
    force: bool = False,              # Rerun existing results
    git_enabled: bool = False,        # Enable git collaboration
    messaging_enabled: bool = True,   # Enable messaging
    auto_eval: bool = True,           # Auto-evaluate after run
    eval_concurrency: int = 10,
    backend: str = "modal",           # "modal", "gcp", or "docker"
    agent_config: str | None = None,  # Path to agent config file
) -> None:
    ...

Understanding results

Success metrics

Each evaluation provides several success indicators:
  • Individual success: Did each agent’s patch pass its own tests?
  • Merge success: Did the patches merge without conflicts?
  • Overall success: Did the merged result pass all tests?
A task succeeds only if all agents’ patches pass their tests AND the merged result passes all combined tests.

Common failure modes

Agents fail to integrate partner state information. For example:
  • Agent 1 adds a new parameter to a function
  • Agent 2 calls that function without the new parameter
  • Tests fail despite no merge conflicts
Questions go unanswered, breaking decision loops:
  • Agent 1 asks: “Should I use sync or async?”
  • Agent 2 doesn’t respond or gives an unclear answer
  • Agent 1 proceeds with a guess that conflicts with Agent 2’s choice
Agents break promises or make unverifiable claims:
  • Agent 1 promises to “add type hints”
  • Agent 1’s patch has incomplete type annotations
  • Agent 2’s code assumes full type coverage and fails

Next steps

CLI reference

Explore all CLI commands and options

Configuration

Configure backends, agents, and models

Dataset

Understand the benchmark dataset structure

Evaluation

Deep dive into evaluation metrics