Dataset statistics

CooperBench includes 652 tasks across 12 real-world repositories, spanning multiple programming languages and domains. The dataset is designed to evaluate agent coordination on realistic software engineering scenarios.

Overview

Total tasks

652 tasksDerived from real GitHub pull requests

Repositories

12 repositoriesPopular open-source projects

Languages

4 languagesPython, TypeScript, Go, and Rust

Repository distribution

The benchmark spans 12 high-quality open-source repositories, selected for diversity in domain, size, and complexity.

Repository	Tasks	Features	Language	Domain
pallets/jinja	3	30	Python	Template engine
pallets/click	3	27	Python	CLI framework
stanfordnlp/dspy	4	23	Python	LLM framework
dottxt-ai/outlines	3	22	Python	Structured generation
run-llama/llama_index	3	16	Python	RAG framework
python-pillow/Pillow	3	15	Python	Image processing
huggingface/datasets	3	13	Python	Data loading
go-chi/chi	3	13	Go	HTTP router
react-hook-form/react-hook-form	2	11	TypeScript	Form library
openai/tiktoken	1	10	Python	Tokenizer
typst/typst	1	10	Rust	Typesetting system
samuelcolvin/dirty-equals	1	9	Python	Testing utilities

Selection criteria

Repositories were chosen based on:

Activity: High-quality, actively maintained projects
Popularity: Wide user base and real-world usage
Diversity: Mix of domains (ML, web, CLI, image processing, etc.)
Test coverage: Strong test suites for accurate evaluation
Complexity: Range from 10K to 100K+ lines of code

Language breakdown

The dataset includes four programming languages, reflecting real-world development diversity:

Python

9 repositories, ~550 tasksDominant language covering ML frameworks, CLI tools, web libraries, and utilities

TypeScript

1 repository, ~50 tasksFrontend framework (react-hook-form)

Go

1 repository, ~40 tasksHTTP routing library (chi)

Rust

1 repository, ~12 tasksTypesetting system (typst)

Python dominates the dataset because many popular open-source projects (especially in ML/AI) use Python. This reflects real-world development patterns where multi-agent collaboration is most likely to be deployed.

Task complexity metrics

Each task in CooperBench represents a real pull request that was decomposed into 2+ independent features.

Features per task

Minimum: 2 features (required for cooperation)
Maximum: 10 features (complex PRs)
Average: ~5.1 features per task
Total feature pairs: 199 pairs in full dataset

Task characteristics

Lines of code changed

Range: 10 to 500+ lines per feature

Small features: 10-50 lines (bug fixes, simple additions)
Medium features: 50-150 lines (new functions, refactoring)
Large features: 150+ lines (major components, API changes)

Most tasks involve 50-100 lines of changes across 2-5 files.

Files modified

Range: 1 to 15 files per feature

Single-file features: Isolated changes to one module
Multi-file features: Cross-cutting changes requiring coordination
Test files: Every feature includes corresponding test changes

Average: 3-4 files per feature (including tests).

Conflict potential

Measured by file overlap between features:

High overlap (greater than 50%): 15% of pairs — requires careful coordination
Medium overlap (20-50%): 35% of pairs — some shared files
Low overlap (less than 20%): 50% of pairs — mostly independent

This distribution ensures the benchmark tests both conflict-heavy and conflict-light scenarios.

Test coverage

All tasks include comprehensive tests:

Unit tests: Function-level validation
Integration tests: Cross-module interaction
Regression tests: Ensure existing behavior preserved

Average: 5-10 test cases per feature, enabling precise evaluation.

Evaluation subsets

For faster iteration during development and research, CooperBench provides two curated subsets:

Lite subset

26 tasks, 100 pairs, 12 repositoriesQuick evaluation subset for rapid experimentation. Generated via uniform pair-level sampling with seed=42.

Characteristics:

Covers all 12 repositories
Balanced distribution across domains
~15% of full dataset
Runs in ~2 hours with parallel execution

Usage:

cooperbench run --subset lite --setting coop

Flash subset

20 tasks, 50 pairs, 11 repositoriesMinimal dev subset for rapid iteration. Sampled from lite subset.

Characteristics:

11 of 12 repositories (excludes one)
Fastest evaluation option
~8% of full dataset
Runs in ~1 hour with parallel execution

Usage:

cooperbench run --subset flash --setting coop

Both subsets use uniform pair-level sampling (not task-level) to ensure representative difficulty distribution. A fixed seed (42) ensures reproducibility.

Dataset structure

Each task follows a consistent directory structure:

dataset/
  {repo_name}_task/
    task{id}/
      Dockerfile          # Container for isolated testing
      combined.patch      # Full PR with all features
      setup.sh            # Repository setup script
      runner.sh           # Test wrapper
      run_tests.sh        # Test execution
      feature{N}/
        feature.md        # Feature description for agents
        feature.patch     # Golden implementation
        tests.patch       # Test cases for this feature

File purposes

feature.md: Natural language description given to agents as the task specification
feature.patch: Ground truth implementation used for evaluation reference
tests.patch: Test cases that agents must pass (added to repository before agent runs)
combined.patch: The original PR showing how all features integrate

Statistics summary

Metric	Value
Total tasks	652
Total repositories	12
Programming languages	4 (Python, TypeScript, Go, Rust)
Total features	~3,300
Possible feature pairs	199
Lite subset pairs	100
Flash subset pairs	50
Average features per task	5.1
Average lines changed per feature	75
Average files modified per feature	3.4
Tasks with high file overlap	15%
Tasks with test coverage	100%

Download the dataset

The full CooperBench dataset is available on Hugging Face. Clone it with:

git clone https://huggingface.co/datasets/cooperbench/cooperbench dataset/

Get Started

Core Concepts

Guides

Results & Analysis

Overview

Total tasks

Repositories

Languages

Repository distribution

Selection criteria

Language breakdown

Python

TypeScript

Go

Rust

Task complexity metrics

Features per task

Task characteristics

Evaluation subsets

Lite subset

Lite subset

Flash subset

Flash subset

Dataset structure

File purposes

Statistics summary

Download the dataset

Get Started

Core Concepts

Guides

Results & Analysis

​Overview

Total tasks

Repositories

Languages

​Repository distribution

​Selection criteria

​Language breakdown

Python

TypeScript

Go

Rust

​Task complexity metrics

​Features per task

​Task characteristics

​Evaluation subsets

​Lite subset

Lite subset

​Flash subset

Flash subset

​Dataset structure

​File purposes

​Statistics summary

Download the dataset

Overview

Repository distribution

Selection criteria

Language breakdown

Task complexity metrics

Features per task

Task characteristics

Evaluation subsets

Lite subset

Flash subset

Dataset structure

File purposes

Statistics summary