Introduction

What is CooperBench?

CooperBench is a comprehensive benchmark for evaluating multi-agent coordination in collaborative code generation. It measures how well AI agents can work together as teammates when handling individual tasks that may conflict with each other.

While single AI agents can solve increasingly complex programming tasks, coordinating multiple agents presents unique challenges that haven’t been systematically measured—until now.

The coordination deficit problem

Our research reveals a critical finding: coordinating agents perform much worse than a single agent given the same total workload. This coordination deficit presents a fundamental barrier to deploying AI systems that can work alongside humans or other agents.

Key findings

Performance gap

GPT-5 and Claude Sonnet 4.5 achieve only 25% success with two-agent cooperation—roughly 50% lower than a single agent handling both tasks

Communication cost

Agents spend up to 20% of their budget on communication, reducing merge conflicts but not improving overall success

Capability gaps

Three types of failures underlie coordination problems: expectation (42%), communication (26%), and commitment (32%)

Benchmark statistics

CooperBench provides a comprehensive dataset for evaluating agent coordination:

Metric	Value
Tasks	652
Repositories	12
Languages	Python, TypeScript, Go, Rust
Feature pairs	Multiple per task
Settings	Cooperative and solo

How it works

Task assignment

Each task contains two features that need to be implemented in the same codebase. In cooperative mode, two agents each handle one feature. In solo mode, a single agent handles both features sequentially.

Agent execution

Agents work in isolated sandboxes with access to the repository, tests, and feature descriptions. In cooperative mode, agents can communicate via Redis messaging and optionally collaborate via Git.

Evaluation

Generated patches are tested against golden test suites to measure correctness, conflicts, and integration quality.

Dataset structure

Each task in CooperBench follows a consistent structure:

dataset/
  <repo_name>/
    task<id>/
      setup.sh          # Repository setup script
      run_tests.sh      # Test runner script
      feature1/
        feature.md      # Feature description
        feature.patch   # Golden implementation
        tests.patch     # Test cases
      feature2/
        feature.md
        feature.patch
        tests.patch

The golden patches and test cases are hidden from agents during execution to ensure fair evaluation.

Experiment settings

CooperBench supports two experimental settings:

Cooperative (coop)

Two agents work simultaneously, each implementing one feature. They can:

Send messages via Redis for coordination
Push/pull/merge changes via shared Git remote (optional)
Access the same codebase in isolated sandboxes

This setting reveals coordination challenges like merge conflicts, communication failures, and integration issues.

Solo

A single agent implements both features sequentially. This provides a baseline for comparing against cooperative performance.Solo agents have the advantage of perfect information about both features but may struggle with context length limitations.

Research paper

CooperBench is based on peer-reviewed research published in 2026:

CooperBench: Why Coding Agents Cannot be Your Teammates Yet

Read the full paper on arXiv to learn about our methodology, experiments, and detailed findings about agent coordination failures.

Citation:

@article{cooperbench2026,
  title={CooperBench: Why Coding Agents Cannot be Your Teammates Yet},
  author={Khatua*, Arpandeep and Zhu*, Hao and Tran†, Peter and Prabhudesai†, Arya
          and Sadrieh†, Frederic and Lieberwirth†, Johann K. and Yu, Xinkai
          and Fu, Yicheng and Ryan, Michael J. and Pei, Jiaxin and Yang, Diyi},
  journal={arXiv preprint},
  year={2026},
  url={https://arxiv.org/abs/2601.13295},
  note={*Equal contribution (Stanford) · †Equal contribution (SAP Labs)}
}

Links and resources

Website

Visit the official CooperBench website

Dataset

Download the full dataset from HuggingFace

GitHub

Explore the source code and contribute

PyPI

Install via pip

Next steps

Installation

Set up CooperBench with your preferred backend

Quick start

Run your first experiment in minutes

Get Started

Core Concepts

Guides

Results & Analysis

What is CooperBench?

The coordination deficit problem

Key findings

Performance gap

Communication cost

Capability gaps

Benchmark statistics

How it works

Dataset structure

Experiment settings

Research paper

CooperBench: Why Coding Agents Cannot be Your Teammates Yet

Links and resources

Website

Dataset

GitHub

PyPI

Next steps

Installation

Quick start

Get Started

Core Concepts

Guides

Results & Analysis

​What is CooperBench?

​The coordination deficit problem

​Key findings

Performance gap

Communication cost

Capability gaps

​Benchmark statistics

​How it works

​Dataset structure

​Experiment settings

​Research paper

CooperBench: Why Coding Agents Cannot be Your Teammates Yet

​Links and resources

Website

Dataset

GitHub

PyPI

​Next steps

Installation

Quick start

What is CooperBench?

The coordination deficit problem

Key findings

Benchmark statistics

How it works

Dataset structure

Experiment settings

Research paper

Links and resources

Next steps