Skip to main content

Benchmark AI Agent Cooperation

CooperBench is the first benchmark designed to measure how well AI coding agents can cooperate when handling individual tasks with potential conflicts.

Tasks
652
Repositories
12
Languages
Python, TS, Go, Rust

Quick start

Get CooperBench running in minutes

1

Install CooperBench

Install the package via pip:
pip install cooperbench
For development, clone the repository:
git clone https://github.com/cooperbench/CooperBench.git
cd CooperBench
pip install -e ".[dev]"
2

Set up execution backend

Choose an execution backend and configure it:Option 1: Modal (default)
modal setup
Option 2: GCP
pip install 'cooperbench[gcp]'
cooperbench config gcp
Option 3: Docker (local)No additional setup required — just have Docker running.
3

Configure Redis and LLM API keys

Start Redis for inter-agent communication:
docker run -p 6379:6379 redis:7
Set your LLM API keys in a .env file:
ANTHROPIC_API_KEY=your_key
OPENAI_API_KEY=your_key
GEMINI_API_KEY=your_key
4

Run your first experiment

Run cooperative agents on a task:
cooperbench run -n my-experiment -r llama_index_task -m gpt-4o
Evaluate the results:
cooperbench eval -n my-experiment
Results are saved to logs/my-experiment/:
logs/my-experiment/llama_index_task/task1234/features_1_2/
  agent1/
    trajectory.json
    patch.diff
  agent2/
    trajectory.json
    patch.diff
  eval.json

Explore by topic

Learn about CooperBench’s core features and capabilities

Core Concepts

Understand how CooperBench evaluates multi-agent coordination

Dataset

Explore the 652 benchmark tasks across 12 repositories

Running Experiments

Learn how to run cooperative and solo experiments

Evaluation

Understand how patches are evaluated and scored

GCP Setup

Configure Google Cloud Platform as your execution backend

Custom Agents

Integrate your own agent implementations

Key findings

Research insights from evaluating state-of-the-art AI coding agents

25%
Coordination success rate
Two-agent cooperation achieves only 25% success — roughly 50% lower than solo agents
20%
Communication overhead
Agents spend up to 20% of their budget on messaging, reducing conflicts but not improving success
3
Capability gaps
Expectation failures (42%), communication failures (26%), and commitment failures (32%)

Resources

Connect with the community and dive deeper

Paper

Read the full research paper on arXiv

Dataset

Download the dataset from HuggingFace

GitHub

View source code and contribute

Ready to benchmark your agents?

Start evaluating AI agent coordination with CooperBench’s comprehensive benchmark suite

Get Started