Overview

CooperBench is the first benchmark designed to measure how well AI agents can cooperate when handling individual tasks with potential conflicts. It addresses a fundamental question: Can AI agents work together as teammates?

The multi-agent coordination problem

As AI coding agents become more capable, the natural progression is to have multiple agents work together on complex software projects. However, coordination introduces unique challenges:

Shared codebase conflicts

Multiple agents modifying the same files simultaneously can create merge conflicts and introduce bugs

State synchronization

Agents must understand what their teammates are doing and integrate that information into their own work

Communication overhead

Agents need to decide when and how to communicate, balancing information sharing with task execution

Commitment reliability

Agents must make reliable promises about their work and follow through on commitments

Key findings

Research using CooperBench has revealed significant coordination deficits:

Coordination deficit: GPT-5 and Claude Sonnet 4.5 achieve only 25% success with two-agent cooperation, roughly 50% lower than when a single agent handles both tasks.

Three fundamental capability gaps underlie these coordination failures:

Expectation failures (42%)

Agents fail to properly integrate information about their partner’s state. They may:

Ignore messages from teammates
Make assumptions that conflict with communicated plans
Fail to update their mental model based on partner actions

Communication failures (26%)

Questions go unanswered and information doesn’t flow properly. Issues include:

Asymmetric communication (one agent sends, the other doesn’t respond)
Unclear or ambiguous messages
Missing critical information about implementation details

Commitment failures (32%)

Agents break promises or make unverifiable claims:

Promising to implement features in a certain way, then doing something different
Making commitments without following through
Creating dependencies that don’t materialize

How CooperBench evaluates agents

CooperBench uses real-world pull requests from open-source repositories, split into independent features that agents must implement simultaneously.

Evaluation methodology

Each benchmark task follows this process:

Task selection: Each task is derived from a real pull request that introduced multiple features
Feature assignment: Features are assigned to different agents (cooperative) or one agent (solo)
Implementation: Agents work in isolated sandboxes with optional communication channels
Integration: Agent patches are merged together
Testing: Original test suites verify correctness
Scoring: Success requires both features to pass all tests without conflicts

Success criteria

A task is considered successful when:

Individual correctness

Each agent’s patch passes its own feature tests

Merge compatibility

Patches can be merged without conflicts (or conflicts are resolved correctly)

Joint correctness

The merged code passes all tests for both features

Cooperative vs solo settings

CooperBench supports two evaluation modes to measure the coordination deficit:

cooperbench run -n my-experiment -r llama_index_task -m gpt-4o --setting coop

Cooperative setting

2 agents work simultaneously
Each agent implements one feature
Agents can communicate via Redis messaging
Optional git collaboration for code sharing
Measures real-world coordination challenges

Solo setting

1 agent implements both features sequentially
No communication or coordination needed
Provides baseline performance without coordination overhead
Same total workload as cooperative setting

Why compare? The performance gap between solo and cooperative settings quantifies the “coordination deficit” - how much capability is lost due to coordination challenges.

Communication and collaboration

Agents in cooperative mode have access to multiple collaboration mechanisms:

Redis messaging

Agents can send structured messages to teammates:

send_message agent2 "I'm implementing the cache layer in src/cache.py"

Messages appear in the receiving agent’s context:

[Message from agent1]: I'm implementing the cache layer in src/cache.py

Research shows agents spend up to 20% of their budget on communication, which reduces merge conflicts but doesn’t significantly improve overall success rates.

Git collaboration (optional)

When enabled with --git, agents can:

Push code to a shared repository
Fetch teammate branches
Merge changes from other agents
Resolve conflicts through git

This mirrors real developer workflows but adds complexity to the coordination challenge.

What’s next?

Now that you understand how CooperBench evaluates multi-agent coordination:

Dataset structure

Explore the 652 tasks across 12 repositories

Settings comparison

Deep dive into cooperative vs solo evaluation modes

System architecture

Learn how CooperBench executes and evaluates tasks

Quick start

Run your first benchmark evaluation

Get Started

Core Concepts

Guides

Results & Analysis

The multi-agent coordination problem

Shared codebase conflicts

State synchronization

Communication overhead

Commitment reliability

Key findings

How CooperBench evaluates agents

Evaluation methodology

Success criteria

Cooperative vs solo settings

Cooperative setting

Solo setting

Communication and collaboration

Redis messaging

Git collaboration (optional)

What’s next?

Dataset structure

Settings comparison

System architecture

Quick start

Get Started

Core Concepts

Guides

Results & Analysis

​The multi-agent coordination problem

Shared codebase conflicts

State synchronization

Communication overhead

Commitment reliability

​Key findings

​How CooperBench evaluates agents

​Evaluation methodology

​Success criteria

​Cooperative vs solo settings

​Cooperative setting

​Solo setting

​Communication and collaboration

​Redis messaging

​Git collaboration (optional)

​What’s next?

Dataset structure

Settings comparison

System architecture

Quick start

The multi-agent coordination problem

Key findings

How CooperBench evaluates agents

Evaluation methodology

Success criteria

Cooperative vs solo settings

Cooperative setting

Solo setting

Communication and collaboration

Redis messaging

Git collaboration (optional)

What’s next?