Overview
Through extensive evaluation across 652 tasks spanning 12 real-world repositories, CooperBench identifies three major findings about AI agent coordination:Coordination deficit
Two-agent cooperation achieves only 25% success — roughly 50% lower than solo agents
Communication overhead
Agents spend up to 20% of budget on messages that reduce conflicts but not failures
Three capability gaps
Failures stem from expectation (42%), communication (26%), and commitment (32%) breakdowns
Finding 1: Coordination deficit
Agents perform worse together than alone — GPT-5 and Claude Sonnet 4.5 achieve only 25% success with two-agent cooperation, roughly 50% lower than when a single agent handles both tasks.This finding holds across both frontier models tested (GPT-5 and Claude Sonnet 4.5), suggesting the coordination deficit is a fundamental limitation rather than a model-specific weakness.
Why this matters
The coordination deficit reveals a critical gap between agent capabilities and coordination requirements:- Solo performance: Single agents can successfully implement features when given the full context
- Cooperative performance: The same models fail when required to coordinate with a partner agent
- Performance gap: Success rates drop by approximately 50% in cooperative settings
Implications
This finding suggests that:- Scaling through parallelization has limits — simply adding more agents may decrease overall success
- Coordination is a distinct capability — it’s not automatically gained from general coding ability
- Current approaches need rethinking — new mechanisms are needed to enable effective multi-agent collaboration
Finding 2: Communication reduces conflicts but not failures
Communication overhead is substantial but ineffective — Agents spend up to 20% of their token budget on inter-agent communication. While this reduces merge conflicts, it does not improve overall success rates.Detailed breakdown of communication patterns
Detailed breakdown of communication patterns
Analysis of agent conversations reveals:
- Volume: Agents exchange 10-30 messages per task on average
- Budget impact: Communication consumes 15-20% of total token usage
- Merge conflicts: Reduced by ~30% when messaging is enabled
- Success rate: No significant improvement in feature completion
Communication analysis
The communication overhead manifests in several ways: Token budget allocation:- Baseline (solo): 100% of budget spent on implementation
- Cooperative: 80% implementation, 20% communication
- Net effect: Less budget available per agent for actual work
- Merge conflicts decrease with messaging enabled
- Test pass rates remain unchanged
- Conclusion: Agents coordinate on syntax but miss semantics
Key insight
Agents need better mechanisms for:- Sharing semantic intent, not just file changes
- Validating mutual understanding
- Coordinating on cross-cutting concerns
Finding 3: Three capability gaps underlie coordination failures
Analysis of 200+ failed cooperative attempts reveals three distinct failure modes:Expectation failures
42% of failuresAgents fail to integrate partner state information into their own work
Communication failures
26% of failuresQuestions go unanswered, breaking decision loops
Commitment failures
32% of failuresAgents break promises or make unverifiable claims
Expectation failures (42%)
Agents fail to correctly model and integrate their partner’s state and changes. Example scenario:- Agent A implements a new function
process_data() - Agent A tells Agent B: “I added
process_data()to utils.py” - Agent B implements a caller but:
- Assumes wrong function signature
- Doesn’t verify the actual implementation
- Creates incompatible interface
- Query partner’s actual state
- Verify assumptions about partner’s changes
- Detect incompatibilities before committing
Communication failures (26%)
Agents send questions that go unanswered, leading to incorrect assumptions. Example scenario:- Agent A: “Should I use the new or old API format?”
- Agent B: [continues working without responding]
- Agent A: [assumes old format, creates incompatibility]
- No enforcement of message responses
- Agents prioritize task completion over coordination
- Missing protocols for required vs. optional communication
This finding suggests that asynchronous communication alone is insufficient — agents need synchronous checkpoints or enforced response protocols for critical decisions.
Commitment failures (32%)
Agents break promises or make claims that can’t be verified. Example scenario:- Agent A: “I’ll add error handling to all edge cases”
- Agent B: [builds on this assumption]
- Agent A: [implements only partial error handling]
- Result: Agent B’s code fails on unhandled errors
- No verification of agent commitments
- Vague promises that can’t be tested
- Missing accountability mechanisms
Synthesis: What this means for AI collaboration
These three findings reveal fundamental gaps in current AI agents’ ability to collaborate:Current state: Agents can code but can't coordinate
Current state: Agents can code but can't coordinate
Modern AI agents demonstrate strong individual coding capabilities:
- Understanding requirements
- Writing correct implementations
- Running tests and debugging
- Modeling partner state
- Maintaining semantic alignment
- Verifying shared assumptions
The coordination deficit is not about conflicts
The coordination deficit is not about conflicts
Initial hypotheses suggested coordination failures stemmed from file conflicts or merge issues. The data shows otherwise:
- Messaging reduces conflicts by 30%
- Success rates remain unchanged
- Failures occur even with clean merges
Communication alone is insufficient
Communication alone is insufficient
Simply enabling agents to message each other doesn’t solve coordination:
- 20% of budget spent on communication
- High message volume (10-30 per task)
- But no improvement in success
Three distinct failure modes require different solutions
Three distinct failure modes require different solutions
Each capability gap needs specific interventions:For expectation failures:
- Shared state representation
- Assumption verification protocols
- Integration testing before commit
- Required response mechanisms
- Synchronous checkpoints
- Priority-based message queues
- Testable commitment specifications
- Automated verification
- Accountability tracking
Future directions
These findings point to several research directions:- Coordination-aware training: Models trained explicitly on multi-agent scenarios
- Formal protocols: Structured communication with enforced semantics
- Shared representations: Better state modeling and assumption tracking
- Verification mechanisms: Automated checks for commitments and compatibility
Read the full paper
For detailed analysis, evaluation methodology, and complete results, see our arXiv paper: “CooperBench: Why Coding Agents Cannot be Your Teammates Yet”