Skip to main content
CooperBench supports two evaluation settings that enable measuring the “coordination deficit” - the performance gap between individual and collaborative agent work.

Setting comparison

Cooperative setting

2 agents collaborate on separate features with communication

Solo setting

1 agent implements both features sequentially

Quick comparison

AspectCooperativeSolo
Number of agents21
Features per agent12
Total workloadSameSame
CommunicationRedis messagingNone
Git collaborationOptionalN/A
ConcurrencyParallel executionSequential
ComplexityHigh (coordination)Low (isolation)

Cooperative setting

In cooperative mode, two agents work simultaneously on separate features, simulating a team development scenario.

Architecture

How it works

1

Feature assignment

Each of the two features is assigned to a separate agent. Agents work in isolated sandboxes.
2

Parallel execution

Both agents start simultaneously and work in parallel, implementing their assigned features.
3

Communication (optional)

Agents can send messages to each other via Redis:
send_message agent2 "I'm modifying src/cache.py for feature 1"
4

Git collaboration (optional)

With --git flag, agents can push/pull/merge code:
git push team agent1
git fetch team
git merge team/agent2
5

Patch generation

Each agent produces a patch file with their changes.
6

Merge and evaluate

Patches are merged together and tested to verify both features work correctly.

Running cooperative mode

cooperbench run \
  -n my-experiment \
  -r llama_index_task \
  -m gpt-4o \
  --setting coop

Communication mechanisms

Agents in cooperative mode have two ways to collaborate:
Default communication channel for inter-agent messagesFeatures:
  • Async message passing
  • Namespaced by run ID
  • Messages appear in agent context
  • Tracked in conversation logs
Example flow:
# Agent 1 sends
send_message agent2 "Working on authentication in auth.py"

# Agent 2 receives (appears in context)
[Message from agent1]: Working on authentication in auth.py

# Agent 2 responds
send_message agent1 "Got it, I'll handle validation in validators.py"
Research shows agents spend up to 20% of their budget on messaging, reducing conflicts but not significantly improving success rates.

Configuration options

# Enable messaging (default)
cooperbench run -n exp --setting coop

# Disable messaging
cooperbench run -n exp --setting coop --no-messaging

# Enable git collaboration
cooperbench run -n exp --setting coop --git

# Custom Redis URL
cooperbench run -n exp --setting coop --redis redis://custom:6379

Solo setting

In solo mode, a single agent implements both features sequentially, providing a baseline without coordination overhead.

Architecture

How it works

1

Combined task

Both feature descriptions are combined into a single prompt:
## Feature 1
Add caching support...

---

## Feature 2  
Add logging functionality...
2

Sequential implementation

The agent implements both features in a single session, with full context of both requirements.
3

Unified patch

Agent produces a single patch containing all changes for both features.
4

Evaluation

Tests for both features are run against the combined patch.

Running solo mode

cooperbench run \
  -n my-experiment \
  -r llama_index_task \
  -m gpt-4o \
  --setting solo

Advantages

No coordination overhead

Agent doesn’t need to communicate or merge with others

Full context

Agent sees both feature requirements upfront

Simpler execution

No messaging, git servers, or merge conflicts

Baseline performance

Shows maximum achievable without coordination

When to use each setting

Understanding the coordination deficit

The performance gap between settings reveals coordination challenges:
Coordination deficit formula:
Deficit = (Solo Success Rate - Coop Success Rate) / Solo Success Rate
Example: If solo achieves 50% and coop achieves 25%:
Deficit = (0.50 - 0.25) / 0.50 = 0.50 (50% deficit)

Research findings

  • Solo: ~50% success rate
  • Cooperative: ~25% success rate
  • Deficit: 50% performance loss due to coordination
  • Solo: ~45% success rate
  • Cooperative: ~22% success rate
  • Deficit: 51% performance loss due to coordination
  • Agents use 10-20% of budget on messaging
  • Reduces merge conflicts by ~15%
  • Does not improve overall success rates
  • Indicates communication quality issues

Output structure

Results are organized differently per setting:
logs/my-experiment/coop/llama_index_task/task123/f1_f2/
├── result.json          # Overall task result
├── conversation.json    # Inter-agent messages
├── agent1.patch         # Agent 1's changes
├── agent2.patch         # Agent 2's changes  
├── agent1_traj.json     # Agent 1's trajectory
├── agent2_traj.json     # Agent 2's trajectory
└── eval.json            # Test results

Comparing results

After running both settings, compare results:
import json
from pathlib import Path

# Load results
coop_result = json.loads(Path("logs/exp/coop/.../result.json").read_text())
solo_result = json.loads(Path("logs/exp/solo/.../result.json").read_text())

# Compare metrics
print(f"Cooperative cost: ${coop_result['total_cost']:.2f}")
print(f"Solo cost: ${solo_result['total_cost']:.2f}")

print(f"Cooperative steps: {coop_result['total_steps']}")
print(f"Solo steps: {solo_result['total_steps']}")

# Load evaluations
coop_eval = json.loads(Path("logs/exp/coop/.../eval.json").read_text())
solo_eval = json.loads(Path("logs/exp/solo/.../eval.json").read_text())

print(f"Cooperative passed: {coop_eval['both_passed']}")
print(f"Solo passed: {solo_eval['both_passed']}")

What’s next?

System architecture

Learn how settings are executed under the hood

Run experiments

Start running benchmarks with different settings

CLI reference

Complete command options for both settings

Dataset overview

Explore the benchmark task structure