The multi-agent coordination problem
As AI coding agents become more capable, the natural progression is to have multiple agents work together on complex software projects. However, coordination introduces unique challenges:Shared codebase conflicts
Multiple agents modifying the same files simultaneously can create merge conflicts and introduce bugs
State synchronization
Agents must understand what their teammates are doing and integrate that information into their own work
Communication overhead
Agents need to decide when and how to communicate, balancing information sharing with task execution
Commitment reliability
Agents must make reliable promises about their work and follow through on commitments
Key findings
Research using CooperBench has revealed significant coordination deficits:Coordination deficit: GPT-5 and Claude Sonnet 4.5 achieve only 25% success with two-agent cooperation, roughly 50% lower than when a single agent handles both tasks.
Expectation failures (42%)
Expectation failures (42%)
Agents fail to properly integrate information about their partner’s state. They may:
- Ignore messages from teammates
- Make assumptions that conflict with communicated plans
- Fail to update their mental model based on partner actions
Communication failures (26%)
Communication failures (26%)
Questions go unanswered and information doesn’t flow properly. Issues include:
- Asymmetric communication (one agent sends, the other doesn’t respond)
- Unclear or ambiguous messages
- Missing critical information about implementation details
Commitment failures (32%)
Commitment failures (32%)
Agents break promises or make unverifiable claims:
- Promising to implement features in a certain way, then doing something different
- Making commitments without following through
- Creating dependencies that don’t materialize
How CooperBench evaluates agents
CooperBench uses real-world pull requests from open-source repositories, split into independent features that agents must implement simultaneously.Evaluation methodology
Each benchmark task follows this process:- Task selection: Each task is derived from a real pull request that introduced multiple features
- Feature assignment: Features are assigned to different agents (cooperative) or one agent (solo)
- Implementation: Agents work in isolated sandboxes with optional communication channels
- Integration: Agent patches are merged together
- Testing: Original test suites verify correctness
- Scoring: Success requires both features to pass all tests without conflicts
Success criteria
A task is considered successful when:Cooperative vs solo settings
CooperBench supports two evaluation modes to measure the coordination deficit:Cooperative setting
- 2 agents work simultaneously
- Each agent implements one feature
- Agents can communicate via Redis messaging
- Optional git collaboration for code sharing
- Measures real-world coordination challenges
Solo setting
- 1 agent implements both features sequentially
- No communication or coordination needed
- Provides baseline performance without coordination overhead
- Same total workload as cooperative setting
Why compare? The performance gap between solo and cooperative settings quantifies the “coordination deficit” - how much capability is lost due to coordination challenges.
Communication and collaboration
Agents in cooperative mode have access to multiple collaboration mechanisms:Redis messaging
Agents can send structured messages to teammates:Research shows agents spend up to 20% of their budget on communication, which reduces merge conflicts but doesn’t significantly improve overall success rates.
Git collaboration (optional)
When enabled with--git, agents can:
- Push code to a shared repository
- Fetch teammate branches
- Merge changes from other agents
- Resolve conflicts through git
What’s next?
Now that you understand how CooperBench evaluates multi-agent coordination:Dataset structure
Explore the 652 tasks across 12 repositories
Settings comparison
Deep dive into cooperative vs solo evaluation modes
System architecture
Learn how CooperBench executes and evaluates tasks
Quick start
Run your first benchmark evaluation