Skip to main content
The CooperBench dataset contains 652 tasks derived from real pull requests across 12 open-source repositories, spanning Python, TypeScript, Go, and Rust.
The dataset is available on HuggingFace and can be downloaded with:
git clone https://huggingface.co/datasets/cooperbench/cooperbench dataset/

Task construction methodology

Each task in CooperBench is derived from a real-world pull request:
1

PR selection

Pull requests are selected from popular open-source repositories based on:
  • Multiple independent features in a single PR
  • Comprehensive test coverage
  • Clear feature boundaries
2

Feature extraction

Each PR is analyzed to identify independent features that:
  • Can be implemented separately
  • Have distinct test suites
  • May interact when merged together
3

Task packaging

For each task, we extract:
  • Feature descriptions (from PR or commits)
  • Implementation patches (golden reference)
  • Test patches (verification suite)
  • Setup and runner scripts
This ensures every task represents a realistic coordination scenario that developers face in practice.

Repository coverage

The benchmark spans 12 repositories across multiple languages and domains:
RepositoryLanguageTasksFeaturesDomain
dottxt-ai/outlinesPython322LLM framework
stanfordnlp/dspyPython423NLP framework
go-chi/chiGo313HTTP router
huggingface/datasetsPython313Dataset library
run-llama/llama_indexPython316Data framework
openai/tiktokenPython110Tokenizer
pallets/clickPython327CLI framework
pallets/jinjaPython330Template engine
python-pillow/PillowPython315Image library
react-hook-form/react-hook-formTypeScript211React forms
samuelcolvin/dirty-equalsPython19Testing utility
typst/typstRust110Typesetting
Total: 30 base tasks, expanded to 652 unique task instances through feature pair combinations.

Directory structure

The dataset follows a consistent hierarchy:
dataset/
├── {repo_name}_task/              # Repository-specific directory
│   └── task{id}/                  # Individual task
│       ├── Dockerfile             # Container environment
│       ├── combined.patch         # Full PR (all features merged)
│       ├── setup.sh               # Repository setup script
│       ├── runner.sh              # Test wrapper
│       ├── run_tests.sh           # Test execution
│       └── feature{N}/            # Per-feature files
│           ├── feature.md         # Feature description
│           ├── feature.patch      # Implementation (gold)
│           └── tests.patch        # Test cases
└── subsets/
    ├── flash.json                 # 20 tasks, 50 pairs (dev)
    └── lite.json                  # 26 tasks, 100 pairs (quick eval)

File descriptions

Natural language description of what the agent should implement. This is the “task prompt” given to agents.Example:
Add caching support for expensive operations

Implement a cache decorator that memoizes function results...
The reference implementation extracted from the original PR. Used for comparison and analysis, not given to agents.
diff --git a/src/cache.py b/src/cache.py
new file mode 100644
index 0000000..1234567
--- /dev/null
+++ b/src/cache.py
@@ -0,0 +1,20 @@
+def cache(func):
+    memo = {}
+    def wrapper(*args):
+        if args not in memo:
+            memo[args] = func(*args)
+        return memo[args]
+    return wrapper
Test cases that verify the feature implementation. Applied before running tests to validate agent output.
diff --git a/tests/test_cache.py b/tests/test_cache.py
new file mode 100644
index 0000000..abcdef
--- /dev/null
+++ b/tests/test_cache.py
@@ -0,0 +1,15 @@
+def test_cache_memoization():
+    call_count = 0
+    @cache
+    def expensive():
+        nonlocal call_count
+        call_count += 1
+        return 42
Prepares the repository environment:
#!/bin/bash
set -e

# Clone repository at specific commit
git clone --depth 1 --branch v2.1.0 https://github.com/org/repo
cd repo

# Install dependencies
pip install -e ".[dev]"
Executes the test suite for verification:
#!/bin/bash
cd repo
pytest tests/ -v --tb=short

Feature pairs and task instances

Each base task contains multiple features. The benchmark evaluates all possible feature pairs:
Example: A task with 5 features generates C(5,2) = 10 unique task instances, each testing a different pair of features.
This approach:
  • Tests various conflict patterns
  • Increases dataset diversity
  • Provides multiple difficulty levels per repository
from itertools import combinations

# Task with features [1, 2, 3, 4, 5]
features = [1, 2, 3, 4, 5]

# Generate all pairs
pairs = list(combinations(features, 2))
# [(1,2), (1,3), (1,4), (1,5), (2,3), (2,4), (2,5), (3,4), (3,5), (4,5)]

# Each pair is a separate task instance

Dataset subsets

CooperBench provides curated subsets for faster iteration:
20 tasks, 50 feature pairs across 11 repositories
  • Designed for rapid development and debugging
  • Sampled uniformly from the lite subset
  • Fixed seed (42) for reproducibility
cooperbench run -n dev-test --subset flash

Subset generation

Subsets use uniform pair-level sampling to ensure diversity:
import random
from itertools import combinations

random.seed(42)  # Fixed for reproducibility

# Generate all possible feature pairs
all_pairs = []
for task in all_tasks:
    for f1, f2 in combinations(task.features, 2):
        all_pairs.append((task.repo, task.task_id, [f1, f2]))

# Sample uniformly
sampled_pairs = random.sample(all_pairs, 100)
Subset files specify exact feature pairs:
{
  "repo": "pallets_jinja_task",
  "task_id": 1621,
  "pairs": [[1, 6], [2, 9], [3, 4]]
}

Task difficulty and characteristics

Tasks vary across multiple dimensions:

Code complexity

  • Lines of code changed
  • Number of files modified
  • Depth of call chains

Feature independence

  • Shared vs. separate files
  • Overlapping vs. distinct modules
  • Semantic dependencies

Test coverage

  • Number of test cases
  • Integration vs. unit tests
  • Edge case coverage

Domain knowledge

  • Framework-specific APIs
  • Language idioms
  • Architectural patterns

Accessing task data

You can programmatically access task information:
from pathlib import Path

# Read feature description
task_dir = Path("dataset/llama_index_task/task123")
feature_desc = (task_dir / "feature1/feature.md").read_text()

# Load patches
impl_patch = (task_dir / "feature1/feature.patch").read_text()
test_patch = (task_dir / "feature1/tests.patch").read_text()

# Check available features
features = sorted([int(f.name.replace("feature", "")) 
                   for f in task_dir.glob("feature*")])
print(f"Task has {len(features)} features")

What’s next?

Run tasks

Execute benchmark tasks with your agents

Understand settings

Learn about cooperative vs solo evaluation modes

System architecture

Explore how tasks are executed and evaluated

CLI reference

Complete command-line interface documentation