Skip to main content
CooperBench includes 652 tasks across 12 real-world repositories, spanning multiple programming languages and domains. The dataset is designed to evaluate agent coordination on realistic software engineering scenarios.

Overview

Total tasks

652 tasksDerived from real GitHub pull requests

Repositories

12 repositoriesPopular open-source projects

Languages

4 languagesPython, TypeScript, Go, and Rust

Repository distribution

The benchmark spans 12 high-quality open-source repositories, selected for diversity in domain, size, and complexity.
RepositoryTasksFeaturesLanguageDomain
pallets/jinja330PythonTemplate engine
pallets/click327PythonCLI framework
stanfordnlp/dspy423PythonLLM framework
dottxt-ai/outlines322PythonStructured generation
run-llama/llama_index316PythonRAG framework
python-pillow/Pillow315PythonImage processing
huggingface/datasets313PythonData loading
go-chi/chi313GoHTTP router
react-hook-form/react-hook-form211TypeScriptForm library
openai/tiktoken110PythonTokenizer
typst/typst110RustTypesetting system
samuelcolvin/dirty-equals19PythonTesting utilities

Selection criteria

Repositories were chosen based on:
  • Activity: High-quality, actively maintained projects
  • Popularity: Wide user base and real-world usage
  • Diversity: Mix of domains (ML, web, CLI, image processing, etc.)
  • Test coverage: Strong test suites for accurate evaluation
  • Complexity: Range from 10K to 100K+ lines of code

Language breakdown

The dataset includes four programming languages, reflecting real-world development diversity:

Python

9 repositories, ~550 tasksDominant language covering ML frameworks, CLI tools, web libraries, and utilities

TypeScript

1 repository, ~50 tasksFrontend framework (react-hook-form)

Go

1 repository, ~40 tasksHTTP routing library (chi)

Rust

1 repository, ~12 tasksTypesetting system (typst)
Python dominates the dataset because many popular open-source projects (especially in ML/AI) use Python. This reflects real-world development patterns where multi-agent collaboration is most likely to be deployed.

Task complexity metrics

Each task in CooperBench represents a real pull request that was decomposed into 2+ independent features.

Features per task

  • Minimum: 2 features (required for cooperation)
  • Maximum: 10 features (complex PRs)
  • Average: ~5.1 features per task
  • Total feature pairs: 199 pairs in full dataset

Task characteristics

Range: 10 to 500+ lines per feature
  • Small features: 10-50 lines (bug fixes, simple additions)
  • Medium features: 50-150 lines (new functions, refactoring)
  • Large features: 150+ lines (major components, API changes)
Most tasks involve 50-100 lines of changes across 2-5 files.
Range: 1 to 15 files per feature
  • Single-file features: Isolated changes to one module
  • Multi-file features: Cross-cutting changes requiring coordination
  • Test files: Every feature includes corresponding test changes
Average: 3-4 files per feature (including tests).
Measured by file overlap between features:
  • High overlap (greater than 50%): 15% of pairs — requires careful coordination
  • Medium overlap (20-50%): 35% of pairs — some shared files
  • Low overlap (less than 20%): 50% of pairs — mostly independent
This distribution ensures the benchmark tests both conflict-heavy and conflict-light scenarios.
All tasks include comprehensive tests:
  • Unit tests: Function-level validation
  • Integration tests: Cross-module interaction
  • Regression tests: Ensure existing behavior preserved
Average: 5-10 test cases per feature, enabling precise evaluation.

Evaluation subsets

For faster iteration during development and research, CooperBench provides two curated subsets:

Lite subset

Lite subset

26 tasks, 100 pairs, 12 repositoriesQuick evaluation subset for rapid experimentation. Generated via uniform pair-level sampling with seed=42.
Characteristics:
  • Covers all 12 repositories
  • Balanced distribution across domains
  • ~15% of full dataset
  • Runs in ~2 hours with parallel execution
Usage:
cooperbench run --subset lite --setting coop

Flash subset

Flash subset

20 tasks, 50 pairs, 11 repositoriesMinimal dev subset for rapid iteration. Sampled from lite subset.
Characteristics:
  • 11 of 12 repositories (excludes one)
  • Fastest evaluation option
  • ~8% of full dataset
  • Runs in ~1 hour with parallel execution
Usage:
cooperbench run --subset flash --setting coop
Both subsets use uniform pair-level sampling (not task-level) to ensure representative difficulty distribution. A fixed seed (42) ensures reproducibility.

Dataset structure

Each task follows a consistent directory structure:
dataset/
  {repo_name}_task/
    task{id}/
      Dockerfile          # Container for isolated testing
      combined.patch      # Full PR with all features
      setup.sh            # Repository setup script
      runner.sh           # Test wrapper
      run_tests.sh        # Test execution
      feature{N}/
        feature.md        # Feature description for agents
        feature.patch     # Golden implementation
        tests.patch       # Test cases for this feature

File purposes

  • feature.md: Natural language description given to agents as the task specification
  • feature.patch: Ground truth implementation used for evaluation reference
  • tests.patch: Test cases that agents must pass (added to repository before agent runs)
  • combined.patch: The original PR showing how all features integrate

Statistics summary

MetricValue
Total tasks652
Total repositories12
Programming languages4 (Python, TypeScript, Go, Rust)
Total features~3,300
Possible feature pairs199
Lite subset pairs100
Flash subset pairs50
Average features per task5.1
Average lines changed per feature75
Average files modified per feature3.4
Tasks with high file overlap15%
Tasks with test coverage100%

Download the dataset

The full CooperBench dataset is available on Hugging Face. Clone it with:
git clone https://huggingface.co/datasets/cooperbench/cooperbench dataset/