Dataset structure

The CooperBench dataset contains 652 tasks derived from real pull requests across 12 open-source repositories, spanning Python, TypeScript, Go, and Rust.

The dataset is available on HuggingFace and can be downloaded with:

git clone https://huggingface.co/datasets/cooperbench/cooperbench dataset/

Task construction methodology

Each task in CooperBench is derived from a real-world pull request:

PR selection

Pull requests are selected from popular open-source repositories based on:

Multiple independent features in a single PR
Comprehensive test coverage
Clear feature boundaries

Feature extraction

Each PR is analyzed to identify independent features that:

Can be implemented separately
Have distinct test suites
May interact when merged together

Task packaging

For each task, we extract:

Feature descriptions (from PR or commits)
Implementation patches (golden reference)
Test patches (verification suite)
Setup and runner scripts

This ensures every task represents a realistic coordination scenario that developers face in practice.

Repository coverage

The benchmark spans 12 repositories across multiple languages and domains:

Repository	Language	Tasks	Features	Domain
dottxt-ai/outlines	Python	3	22	LLM framework
stanfordnlp/dspy	Python	4	23	NLP framework
go-chi/chi	Go	3	13	HTTP router
huggingface/datasets	Python	3	13	Dataset library
run-llama/llama_index	Python	3	16	Data framework
openai/tiktoken	Python	1	10	Tokenizer
pallets/click	Python	3	27	CLI framework
pallets/jinja	Python	3	30	Template engine
python-pillow/Pillow	Python	3	15	Image library
react-hook-form/react-hook-form	TypeScript	2	11	React forms
samuelcolvin/dirty-equals	Python	1	9	Testing utility
typst/typst	Rust	1	10	Typesetting

Total: 30 base tasks, expanded to 652 unique task instances through feature pair combinations.

Directory structure

The dataset follows a consistent hierarchy:

dataset/
├── {repo_name}_task/              # Repository-specific directory
│   └── task{id}/                  # Individual task
│       ├── Dockerfile             # Container environment
│       ├── combined.patch         # Full PR (all features merged)
│       ├── setup.sh               # Repository setup script
│       ├── runner.sh              # Test wrapper
│       ├── run_tests.sh           # Test execution
│       └── feature{N}/            # Per-feature files
│           ├── feature.md         # Feature description
│           ├── feature.patch      # Implementation (gold)
│           └── tests.patch        # Test cases
└── subsets/
    ├── flash.json                 # 20 tasks, 50 pairs (dev)
    └── lite.json                  # 26 tasks, 100 pairs (quick eval)

File descriptions

feature.md - Feature description

Natural language description of what the agent should implement. This is the “task prompt” given to agents.Example:

Add caching support for expensive operations

Implement a cache decorator that memoizes function results...

feature.patch - Gold implementation

The reference implementation extracted from the original PR. Used for comparison and analysis, not given to agents.

diff --git a/src/cache.py b/src/cache.py
new file mode 100644
index 0000000..1234567
--- /dev/null
+++ b/src/cache.py
@@ -0,0 +1,20 @@
+def cache(func):
+    memo = {}
+    def wrapper(*args):
+        if args not in memo:
+            memo[args] = func(*args)
+        return memo[args]
+    return wrapper

tests.patch - Test suite

Test cases that verify the feature implementation. Applied before running tests to validate agent output.

diff --git a/tests/test_cache.py b/tests/test_cache.py
new file mode 100644
index 0000000..abcdef
--- /dev/null
+++ b/tests/test_cache.py
@@ -0,0 +1,15 @@
+def test_cache_memoization():
+    call_count = 0
+    @cache
+    def expensive():
+        nonlocal call_count
+        call_count += 1
+        return 42

setup.sh - Environment setup

Prepares the repository environment:

#!/bin/bash
set -e

# Clone repository at specific commit
git clone --depth 1 --branch v2.1.0 https://github.com/org/repo
cd repo

# Install dependencies
pip install -e ".[dev]"

run_tests.sh - Test execution

Executes the test suite for verification:

#!/bin/bash
cd repo
pytest tests/ -v --tb=short

Feature pairs and task instances

Each base task contains multiple features. The benchmark evaluates all possible feature pairs:

Example: A task with 5 features generates C(5,2) = 10 unique task instances, each testing a different pair of features.

This approach:

Tests various conflict patterns
Increases dataset diversity
Provides multiple difficulty levels per repository

from itertools import combinations

# Task with features [1, 2, 3, 4, 5]
features = [1, 2, 3, 4, 5]

# Generate all pairs
pairs = list(combinations(features, 2))
# [(1,2), (1,3), (1,4), (1,5), (2,3), (2,4), (2,5), (3,4), (3,5), (4,5)]

# Each pair is a separate task instance

Dataset subsets

CooperBench provides curated subsets for faster iteration:

Flash (dev)
Lite (quick eval)
Full (complete)

20 tasks, 50 feature pairs across 11 repositories

Designed for rapid development and debugging
Sampled uniformly from the lite subset
Fixed seed (42) for reproducibility

cooperbench run -n dev-test --subset flash

26 tasks, 100 feature pairs across 12 repositories

Balanced evaluation across all repositories
Representative sample of full dataset
Good for quick experiments and comparisons

cooperbench run -n experiment --subset lite

30 tasks, 652 feature pairs across 12 repositories

Complete benchmark coverage
Comprehensive evaluation
Use for final results and leaderboard submissions

cooperbench run -n full-eval

Subset generation

Subsets use uniform pair-level sampling to ensure diversity:

import random
from itertools import combinations

random.seed(42)  # Fixed for reproducibility

# Generate all possible feature pairs
all_pairs = []
for task in all_tasks:
    for f1, f2 in combinations(task.features, 2):
        all_pairs.append((task.repo, task.task_id, [f1, f2]))

# Sample uniformly
sampled_pairs = random.sample(all_pairs, 100)

Subset files specify exact feature pairs:

{
  "repo": "pallets_jinja_task",
  "task_id": 1621,
  "pairs": [[1, 6], [2, 9], [3, 4]]
}

Task difficulty and characteristics

Tasks vary across multiple dimensions:

Code complexity

Lines of code changed
Number of files modified
Depth of call chains

Feature independence

Shared vs. separate files
Overlapping vs. distinct modules
Semantic dependencies

Test coverage

Number of test cases
Integration vs. unit tests
Edge case coverage

Domain knowledge

Framework-specific APIs
Language idioms
Architectural patterns

Accessing task data

You can programmatically access task information:

from pathlib import Path

# Read feature description
task_dir = Path("dataset/llama_index_task/task123")
feature_desc = (task_dir / "feature1/feature.md").read_text()

# Load patches
impl_patch = (task_dir / "feature1/feature.patch").read_text()
test_patch = (task_dir / "feature1/tests.patch").read_text()

# Check available features
features = sorted([int(f.name.replace("feature", "")) 
                   for f in task_dir.glob("feature*")])
print(f"Task has {len(features)} features")

What’s next?

Run tasks

Execute benchmark tasks with your agents

Understand settings

Learn about cooperative vs solo evaluation modes

System architecture

Explore how tasks are executed and evaluated

CLI reference

Complete command-line interface documentation

Get Started

Core Concepts

Guides

Results & Analysis

Task construction methodology

Repository coverage

Directory structure

File descriptions

Feature pairs and task instances

Dataset subsets

Subset generation

Task difficulty and characteristics

Code complexity

Feature independence

Test coverage

Domain knowledge

Accessing task data

What’s next?

Run tasks

Understand settings

System architecture

CLI reference

Get Started

Core Concepts

Guides

Results & Analysis

​Task construction methodology

​Repository coverage

​Directory structure

​File descriptions

​Feature pairs and task instances

​Dataset subsets

​Subset generation

​Task difficulty and characteristics

Code complexity

Feature independence

Test coverage

Domain knowledge

​Accessing task data

​What’s next?

Run tasks

Understand settings

System architecture

CLI reference

Task construction methodology

Repository coverage

Directory structure

File descriptions

Feature pairs and task instances

Dataset subsets

Subset generation

Task difficulty and characteristics

Accessing task data

What’s next?