Skip to main content
Run benchmark experiments to evaluate AI agents on cooperative software engineering tasks.

Quick start

Run a simple experiment with default settings:
cooperbench run --setting solo -s lite
This will:
  • Run in solo mode (single agent per task)
  • Use the “lite” subset of tasks
  • Auto-generate an experiment name like solo-msa-gemini-3-flash-lite
  • Save results to logs/

Basic usage

1

Choose a setting

CooperBench supports two evaluation settings:Cooperative (coop): Two agents collaborate on implementing two features
cooperbench run --setting coop -s lite
Solo: Single agent implements both features independently
cooperbench run --setting solo -s lite
2

Select tasks to run

Filter tasks using subsets, repositories, or task IDs:
Use predefined task collections:
cooperbench run --setting solo -s lite
Common subsets:
  • lite - Small subset for quick testing
  • dev - Development subset
  • Custom subsets in dataset/subsets/
3

Name your experiment

Provide a custom name or let CooperBench auto-generate one:
# Custom name
cooperbench run -n my-experiment --setting solo -s lite

# Auto-generated (recommended)
cooperbench run --setting solo -s lite
# → solo-msa-gemini-3-flash-lite
Auto-generated names include the setting, agent, model, and filters, making experiments easy to identify.
4

Run the experiment

Execute and monitor progress:
cooperbench run --setting coop -s lite
Output shows:
  • Task progress with status indicators
  • Cost tracking per task
  • Automatic evaluation results (if enabled)
  • Summary statistics

Command reference

Basic options

-n, --name
string
Experiment name. Auto-generated if not provided.
cooperbench run -n my-experiment --setting solo -s lite
--setting
enum
default:"coop"
Evaluation setting: coop (collaborative) or solo (independent)
cooperbench run --setting solo -s lite
-s, --subset
string
Use a predefined task subset from dataset/subsets/
cooperbench run --setting solo -s lite
-r, --repo
string
Filter by repository name
cooperbench run --setting solo -r llama_index_task
-t, --task
integer
Filter by specific task ID
cooperbench run --setting solo -r llama_index_task -t 8394
-f, --features
string
Specific feature pair to run (comma-separated)
cooperbench run --setting solo -r llama_index_task -t 8394 -f 1,2

Model and agent

-m, --model
string
default:"vertex_ai/gemini-3-flash-preview"
LLM model to use. Supports any LiteLLM-compatible model.
# OpenAI
cooperbench run --setting solo -s lite -m gpt-4o

# Anthropic
cooperbench run --setting solo -s lite -m claude-3-5-sonnet-20241022

# Google Vertex AI
cooperbench run --setting solo -s lite -m vertex_ai/gemini-3-flash-preview
-a, --agent
string
default:"mini_swe_agent"
Agent framework to use
cooperbench run --setting solo -s lite -a mini_swe_agent
See the custom agents guide to implement your own agent.
--agent-config
string
Path to agent-specific configuration file
cooperbench run --setting solo -s lite --agent-config config/custom.yaml

Concurrency

-c, --concurrency
integer
default:"30"
Number of tasks to run in parallel
# High parallelism for faster runs
cooperbench run --setting solo -s lite --concurrency 50

# Low parallelism to reduce costs
cooperbench run --setting solo -s lite --concurrency 5
Higher concurrency increases speed but also API costs and resource usage.

Collaboration features

--git
boolean
Enable git-based collaboration (agents can push/pull/merge)
cooperbench run --setting coop -s lite --git
Only available in cooperative mode. Requires git server setup.
--no-messaging
boolean
Disable inter-agent messaging
cooperbench run --setting coop -s lite --no-messaging
--redis
string
default:"redis://localhost:6379"
Redis URL for inter-agent communication
cooperbench run --setting coop -s lite --redis redis://myhost:6379

Backend selection

--backend
enum
default:"modal"
Execution backend: modal, docker, or gcp
# Modal (cloud, default)
cooperbench run --setting solo -s lite --backend modal

# Docker (local)
cooperbench run --setting solo -s lite --backend docker

# GCP (Google Cloud)
cooperbench run --setting solo -s lite --backend gcp
See the backends guide for details on each option.

Evaluation

--no-auto-eval
boolean
Disable automatic evaluation after task completion
cooperbench run --setting solo -s lite --no-auto-eval
By default, tasks are evaluated automatically as they complete.
--eval-concurrency
integer
default:"10"
Number of parallel evaluations for auto-eval
cooperbench run --setting solo -s lite --eval-concurrency 20

Other options

--force
boolean
Force rerun even if results already exist
cooperbench run --setting solo -s lite --force

Examples

Single task with detailed output

Run one task to see detailed agent output:
cooperbench run --setting solo -r llama_index_task -t 8394 -f 1,2
Output:
cooperbench solo-msa-gemini-3-flash-llama-index-8394 (solo)
task: llama_index_task/8394 features: [1, 2]
agent: mini_swe_agent
model: vertex_ai/gemini-3-flash-preview

┌───────┬──────────┬───────────┬────────┬────────┬───────┐
│ agent │ feature  │ status    │   cost │  steps │ lines │
├───────┼──────────┼───────────┼────────┼────────┼───────┤
│ solo  │ 1,2      │ Submitted │  $0.42 │     18 │    45 │
└───────┴──────────┴───────────┴────────┴────────┴───────┘

total: $0.42 time: 187s

Cooperative experiment

Run multiple tasks with two agents collaborating:
cooperbench run \
  --setting coop \
  -s lite \
  -m gpt-4o \
  --concurrency 10
Output shows progress:
cooperbench coop-msa-gpt-4o-lite (coop)
tasks: 25 concurrency: 10
agent: mini_swe_agent
model: gpt-4o
tools: messaging

✓ done llama_index_task/8394 [1,2]
  ✓ pass llama_index_task/8394 [1,2]
✓ done dspy_task/142 [1,2]
  ✓ pass dspy_task/142 [1,2]
...

runs:  25 completed
evals: 25 evaluated, 23 passed, 2 failed (92.0%)
cost:  $15.30
time:  8m 42s (agent: 6m 15s)

logs: logs/coop-msa-gpt-4o-lite/coop

Solo with git collaboration

Enable git features in solo mode:
cooperbench run \
  --setting solo \
  -s lite \
  --git \
  --backend gcp

Specific model and high concurrency

Run with Claude and high parallelism:
cooperbench run \
  --setting coop \
  -s lite \
  -m claude-3-5-sonnet-20241022 \
  --concurrency 50 \
  --backend gcp

Filter by repository

Run all tasks from a single repository:
cooperbench run \
  --setting solo \
  -r llama_index_task \
  -m gpt-4o

Output structure

Results are saved to logs/{experiment-name}/:
logs/
└── solo-msa-gemini-3-flash-lite/
    ├── config.json              # Experiment configuration
    ├── summary.json             # Aggregate statistics
    └── solo/                    # Setting-specific results
        └── llama_index_task/
            └── 8394/
                └── f1_f2/       # Feature pair results
                    ├── solo.patch        # Generated code changes
                    ├── result.json       # Task execution details
                    ├── eval.json         # Evaluation results
                    └── trajectory.json   # Agent conversation history

Result files

{
  "run_name": "solo-msa-gemini-3-flash-lite",
  "agent_framework": "mini_swe_agent",
  "model": "vertex_ai/gemini-3-flash-preview",
  "setting": "solo",
  "concurrency": 30,
  "total_tasks": 25,
  "started_at": "2024-03-15T10:30:00"
}

Next steps

Evaluation

Learn how to evaluate your experiment results

Backends

Choose the right execution backend for your needs

Custom agents

Implement your own agent framework

GCP setup

Set up Google Cloud Platform backend