Evaluation settings

CooperBench supports two evaluation settings that enable measuring the “coordination deficit” - the performance gap between individual and collaborative agent work.

Setting comparison

Cooperative setting

2 agents collaborate on separate features with communication

Solo setting

1 agent implements both features sequentially

Quick comparison

Aspect	Cooperative	Solo
Number of agents	2	1
Features per agent	1	2
Total workload	Same	Same
Communication	Redis messaging	None
Git collaboration	Optional	N/A
Concurrency	Parallel execution	Sequential
Complexity	High (coordination)	Low (isolation)

Cooperative setting

In cooperative mode, two agents work simultaneously on separate features, simulating a team development scenario.

Architecture

How it works

Feature assignment

Each of the two features is assigned to a separate agent. Agents work in isolated sandboxes.

Parallel execution

Both agents start simultaneously and work in parallel, implementing their assigned features.

Communication (optional)

Agents can send messages to each other via Redis:

send_message agent2 "I'm modifying src/cache.py for feature 1"

Git collaboration (optional)

With --git flag, agents can push/pull/merge code:

git push team agent1
git fetch team
git merge team/agent2

Patch generation

Each agent produces a patch file with their changes.

Merge and evaluate

Patches are merged together and tested to verify both features work correctly.

Running cooperative mode

cooperbench run \
  -n my-experiment \
  -r llama_index_task \
  -m gpt-4o \
  --setting coop

Communication mechanisms

Agents in cooperative mode have two ways to collaborate:

Redis messaging
Git collaboration

Default communication channel for inter-agent messagesFeatures:

Async message passing
Namespaced by run ID
Messages appear in agent context
Tracked in conversation logs

Example flow:

# Agent 1 sends
send_message agent2 "Working on authentication in auth.py"

# Agent 2 receives (appears in context)
[Message from agent1]: Working on authentication in auth.py

# Agent 2 responds
send_message agent1 "Got it, I'll handle validation in validators.py"

Research shows agents spend up to 20% of their budget on messaging, reducing conflicts but not significantly improving success rates.

Optional code-sharing mechanism enabled with --gitFeatures:

Shared git server per task
Standard git commands (push, pull, fetch, merge)
Agent-specific branches
Mirrors real developer workflows

Example flow:

# Agent 1 commits and pushes
git add src/auth.py
git commit -m "Add authentication"
git push team agent1

# Agent 2 fetches and merges
git fetch team
git merge team/agent1
# ... resolve conflicts if needed ...

Git collaboration adds complexity and often decreases success rates as agents struggle with conflict resolution.

Configuration options

# Enable messaging (default)
cooperbench run -n exp --setting coop

# Disable messaging
cooperbench run -n exp --setting coop --no-messaging

# Enable git collaboration
cooperbench run -n exp --setting coop --git

# Custom Redis URL
cooperbench run -n exp --setting coop --redis redis://custom:6379

Solo setting

In solo mode, a single agent implements both features sequentially, providing a baseline without coordination overhead.

Architecture

How it works

Combined task

Both feature descriptions are combined into a single prompt:

## Feature 1
Add caching support...

---

## Feature 2  
Add logging functionality...

Sequential implementation

The agent implements both features in a single session, with full context of both requirements.

Unified patch

Agent produces a single patch containing all changes for both features.

Evaluation

Tests for both features are run against the combined patch.

Running solo mode

cooperbench run \
  -n my-experiment \
  -r llama_index_task \
  -m gpt-4o \
  --setting solo

Advantages

No coordination overhead

Agent doesn’t need to communicate or merge with others

Full context

Agent sees both feature requirements upfront

Simpler execution

No messaging, git servers, or merge conflicts

Baseline performance

Shows maximum achievable without coordination

When to use each setting

Use cooperative when...
Use solo when...

Understanding the coordination deficit

The performance gap between settings reveals coordination challenges:

Coordination deficit formula:

Deficit = (Solo Success Rate - Coop Success Rate) / Solo Success Rate

Example: If solo achieves 50% and coop achieves 25%:

Deficit = (0.50 - 0.25) / 0.50 = 0.50 (50% deficit)

Research findings

GPT-4o performance

Solo: ~50% success rate
Cooperative: ~25% success rate
Deficit: 50% performance loss due to coordination

Claude Sonnet 4.5 performance

Solo: ~45% success rate
Cooperative: ~22% success rate
Deficit: 51% performance loss due to coordination

Communication impact

Agents use 10-20% of budget on messaging
Reduces merge conflicts by ~15%
Does not improve overall success rates
Indicates communication quality issues

Output structure

Results are organized differently per setting:

logs/my-experiment/coop/llama_index_task/task123/f1_f2/
├── result.json          # Overall task result
├── conversation.json    # Inter-agent messages
├── agent1.patch         # Agent 1's changes
├── agent2.patch         # Agent 2's changes  
├── agent1_traj.json     # Agent 1's trajectory
├── agent2_traj.json     # Agent 2's trajectory
└── eval.json            # Test results

Comparing results

After running both settings, compare results:

import json
from pathlib import Path

# Load results
coop_result = json.loads(Path("logs/exp/coop/.../result.json").read_text())
solo_result = json.loads(Path("logs/exp/solo/.../result.json").read_text())

# Compare metrics
print(f"Cooperative cost: ${coop_result['total_cost']:.2f}")
print(f"Solo cost: ${solo_result['total_cost']:.2f}")

print(f"Cooperative steps: {coop_result['total_steps']}")
print(f"Solo steps: {solo_result['total_steps']}")

# Load evaluations
coop_eval = json.loads(Path("logs/exp/coop/.../eval.json").read_text())
solo_eval = json.loads(Path("logs/exp/solo/.../eval.json").read_text())

print(f"Cooperative passed: {coop_eval['both_passed']}")
print(f"Solo passed: {solo_eval['both_passed']}")

What’s next?

System architecture

Learn how settings are executed under the hood

Run experiments

Start running benchmarks with different settings

CLI reference

Complete command options for both settings

Dataset overview

Explore the benchmark task structure

Get Started

Core Concepts

Guides

Results & Analysis

Setting comparison

Cooperative setting

Solo setting

Quick comparison

Cooperative setting

Architecture

How it works

Running cooperative mode

Communication mechanisms

Configuration options

Solo setting

Architecture

How it works

Running solo mode

Advantages

No coordination overhead

Full context

Simpler execution

Baseline performance

When to use each setting

Understanding the coordination deficit

Research findings

Output structure

Comparing results

What’s next?

System architecture

Run experiments

CLI reference

Dataset overview

Get Started

Core Concepts

Guides

Results & Analysis

​Setting comparison

Cooperative setting

Solo setting

​Quick comparison

​Cooperative setting

​Architecture

​How it works

​Running cooperative mode

​Communication mechanisms

​Configuration options

​Solo setting

​Architecture

​How it works

​Running solo mode

​Advantages

No coordination overhead

Full context

Simpler execution

Baseline performance

​When to use each setting

​Understanding the coordination deficit

​Research findings

​Output structure

​Comparing results

​What’s next?

System architecture

Run experiments

CLI reference

Dataset overview

Setting comparison

Quick comparison

Cooperative setting

Architecture

How it works

Running cooperative mode

Communication mechanisms

Configuration options

Solo setting

Architecture

How it works

Running solo mode

Advantages

When to use each setting

Understanding the coordination deficit

Research findings

Output structure

Comparing results

What’s next?