What is CooperBench?
CooperBench is a comprehensive benchmark for evaluating multi-agent coordination in collaborative code generation. It measures how well AI agents can work together as teammates when handling individual tasks that may conflict with each other.While single AI agents can solve increasingly complex programming tasks, coordinating multiple agents presents unique challenges that haven’t been systematically measured—until now.
The coordination deficit problem
Our research reveals a critical finding: coordinating agents perform much worse than a single agent given the same total workload. This coordination deficit presents a fundamental barrier to deploying AI systems that can work alongside humans or other agents.Key findings
Performance gap
GPT-5 and Claude Sonnet 4.5 achieve only 25% success with two-agent cooperation—roughly 50% lower than a single agent handling both tasks
Communication cost
Agents spend up to 20% of their budget on communication, reducing merge conflicts but not improving overall success
Capability gaps
Three types of failures underlie coordination problems: expectation (42%), communication (26%), and commitment (32%)
Benchmark statistics
CooperBench provides a comprehensive dataset for evaluating agent coordination:| Metric | Value |
|---|---|
| Tasks | 652 |
| Repositories | 12 |
| Languages | Python, TypeScript, Go, Rust |
| Feature pairs | Multiple per task |
| Settings | Cooperative and solo |
How it works
Task assignment
Each task contains two features that need to be implemented in the same codebase. In cooperative mode, two agents each handle one feature. In solo mode, a single agent handles both features sequentially.
Agent execution
Agents work in isolated sandboxes with access to the repository, tests, and feature descriptions. In cooperative mode, agents can communicate via Redis messaging and optionally collaborate via Git.
Dataset structure
Each task in CooperBench follows a consistent structure:Experiment settings
CooperBench supports two experimental settings:Cooperative (coop)
Cooperative (coop)
Two agents work simultaneously, each implementing one feature. They can:
- Send messages via Redis for coordination
- Push/pull/merge changes via shared Git remote (optional)
- Access the same codebase in isolated sandboxes
Solo
Solo
A single agent implements both features sequentially. This provides a baseline for comparing against cooperative performance.Solo agents have the advantage of perfect information about both features but may struggle with context length limitations.
Research paper
CooperBench is based on peer-reviewed research published in 2026:CooperBench: Why Coding Agents Cannot be Your Teammates Yet
Read the full paper on arXiv to learn about our methodology, experiments, and detailed findings about agent coordination failures.
Links and resources
Website
Visit the official CooperBench website
Dataset
Download the full dataset from HuggingFace
GitHub
Explore the source code and contribute
PyPI
Install via pip
Next steps
Installation
Set up CooperBench with your preferred backend
Quick start
Run your first experiment in minutes