Before starting, ensure you’ve completed the installation and have:
- An execution backend configured (Modal, GCP, or Docker)
- Redis running (for cooperative mode)
- LLM API keys in your
.envfile - Dataset downloaded to
dataset/
Your first experiment
Let’s run a cooperative experiment with two agents working on a task from the llama_index repository.Run cooperative agents
Execute a task with two agents, each implementing one feature:This command:
- Creates an experiment named
my-first-experiment - Filters tasks to the
llama_index_taskrepository - Uses GPT-4o as the LLM model
- Runs in cooperative mode (default: two agents with messaging)
- Uses Modal backend (default)
- Enables automatic evaluation after completion
What happens during execution
What happens during execution
- CooperBench loads tasks from
dataset/llama_index_task/ - For each task, it spawns two sandboxed environments
- Each agent receives one feature description and can:
- Read/write code files
- Run tests
- Send messages to the other agent via Redis
- Agents work concurrently until completion or timeout
- Generated patches are saved to
logs/my-first-experiment/ - Evaluation runs automatically (if
--no-auto-evalnot specified)
Run a solo experiment
Compare cooperative performance against a single agent handling both features:Evaluate results
If you disabled auto-evaluation or want to re-evaluate:Example output structure
Here’s what a complete experiment looks like:Understanding trajectory.json
Understanding trajectory.json
The
trajectory.json file contains the complete interaction history:Advanced usage
Filter by specific task
Run a single task by ID:Run specific feature pairs
Test a particular combination of features:Use different backends
Enable git collaboration
Allow agents to push/pull/merge via shared Git remote:Disable messaging
Run agents without inter-agent communication:Use different models
CooperBench supports any model via LiteLLM:Run on dataset subsets
Use predefined task subsets for faster iteration:Subsets are defined in
dataset/subsets/. Check available subsets:Control concurrency
Adjust parallel execution for your backend’s capacity:Python API
Use CooperBench programmatically:Available Python API parameters
Available Python API parameters
Understanding results
Success metrics
Each evaluation provides several success indicators:- Individual success: Did each agent’s patch pass its own tests?
- Merge success: Did the patches merge without conflicts?
- Overall success: Did the merged result pass all tests?
A task succeeds only if all agents’ patches pass their tests AND the merged result passes all combined tests.
Common failure modes
Expectation failures (42%)
Expectation failures (42%)
Agents fail to integrate partner state information. For example:
- Agent 1 adds a new parameter to a function
- Agent 2 calls that function without the new parameter
- Tests fail despite no merge conflicts
Communication failures (26%)
Communication failures (26%)
Questions go unanswered, breaking decision loops:
- Agent 1 asks: “Should I use sync or async?”
- Agent 2 doesn’t respond or gives an unclear answer
- Agent 1 proceeds with a guess that conflicts with Agent 2’s choice
Commitment failures (32%)
Commitment failures (32%)
Agents break promises or make unverifiable claims:
- Agent 1 promises to “add type hints”
- Agent 1’s patch has incomplete type annotations
- Agent 2’s code assumes full type coverage and fails
Next steps
CLI reference
Explore all CLI commands and options
Configuration
Configure backends, agents, and models
Dataset
Understand the benchmark dataset structure
Evaluation
Deep dive into evaluation metrics