Benchmark AI Agent Cooperation

CooperBench is the first benchmark designed to measure how well AI coding agents can cooperate when handling individual tasks with potential conflicts.

Get Started Learn More

Tasks

652

Repositories

Languages

Python, TS, Go, Rust

Quick start

Get CooperBench running in minutes

Install CooperBench

Install the package via pip:

pip install cooperbench

For development, clone the repository:

git clone https://github.com/cooperbench/CooperBench.git
cd CooperBench
pip install -e ".[dev]"

Set up execution backend

Choose an execution backend and configure it:Option 1: Modal (default)

modal setup

Option 2: GCP

pip install 'cooperbench[gcp]'
cooperbench config gcp

Option 3: Docker (local)No additional setup required — just have Docker running.

Configure Redis and LLM API keys

Start Redis for inter-agent communication:

docker run -p 6379:6379 redis:7

Set your LLM API keys in a .env file:

ANTHROPIC_API_KEY=your_key
OPENAI_API_KEY=your_key
GEMINI_API_KEY=your_key

Run your first experiment

Run cooperative agents on a task:

cooperbench run -n my-experiment -r llama_index_task -m gpt-4o

Evaluate the results:

cooperbench eval -n my-experiment

View example output

Results are saved to logs/my-experiment/:

logs/my-experiment/llama_index_task/task1234/features_1_2/
  agent1/
    trajectory.json
    patch.diff
  agent2/
    trajectory.json
    patch.diff
  eval.json

Explore by topic

Learn about CooperBench’s core features and capabilities

Core Concepts

Understand how CooperBench evaluates multi-agent coordination

Dataset

Explore the 652 benchmark tasks across 12 repositories

Running Experiments

Learn how to run cooperative and solo experiments

Evaluation

Understand how patches are evaluated and scored

GCP Setup

Configure Google Cloud Platform as your execution backend

Custom Agents

Integrate your own agent implementations

Key findings

Research insights from evaluating state-of-the-art AI coding agents

25%

Coordination success rate

Two-agent cooperation achieves only 25% success — roughly 50% lower than solo agents

20%

Communication overhead

Agents spend up to 20% of their budget on messaging, reducing conflicts but not improving success

Capability gaps

Expectation failures (42%), communication failures (26%), and commitment failures (32%)

Resources

Connect with the community and dive deeper

Paper

Read the full research paper on arXiv

Dataset

Download the dataset from HuggingFace

GitHub

View source code and contribute

Ready to benchmark your agents?

Start evaluating AI agent coordination with CooperBench’s comprehensive benchmark suite

Get Started

Get Started

Core Concepts

Guides

Results & Analysis

Benchmark AI Agent Cooperation

Quick start

Explore by topic

Core Concepts

Dataset

Running Experiments

Evaluation

GCP Setup

Custom Agents

Key findings

Resources

Paper

Dataset

GitHub

Ready to benchmark your agents?