Benchmark AI Agent Cooperation
CooperBench is the first benchmark designed to measure how well AI coding agents can cooperate when handling individual tasks with potential conflicts.
Tasks
652
Repositories
12
Languages
Python, TS, Go, Rust
Quick start
Get CooperBench running in minutes
Set up execution backend
Choose an execution backend and configure it:Option 1: Modal (default)Option 2: GCPOption 3: Docker (local)No additional setup required — just have Docker running.
Configure Redis and LLM API keys
Start Redis for inter-agent communication:Set your LLM API keys in a
.env file:Explore by topic
Learn about CooperBench’s core features and capabilities
Core Concepts
Understand how CooperBench evaluates multi-agent coordination
Dataset
Explore the 652 benchmark tasks across 12 repositories
Running Experiments
Learn how to run cooperative and solo experiments
Evaluation
Understand how patches are evaluated and scored
GCP Setup
Configure Google Cloud Platform as your execution backend
Custom Agents
Integrate your own agent implementations
Key findings
Research insights from evaluating state-of-the-art AI coding agents
25%
Coordination success rate
Two-agent cooperation achieves only 25% success — roughly 50% lower than solo agents
20%
Communication overhead
Agents spend up to 20% of their budget on messaging, reducing conflicts but not improving success
3
Capability gaps
Expectation failures (42%), communication failures (26%), and commitment failures (32%)
Resources
Connect with the community and dive deeper
Paper
Read the full research paper on arXiv
Dataset
Download the dataset from HuggingFace
GitHub
View source code and contribute
Ready to benchmark your agents?
Start evaluating AI agent coordination with CooperBench’s comprehensive benchmark suite
Get Started