Overview
Total tasks
652 tasksDerived from real GitHub pull requests
Repositories
12 repositoriesPopular open-source projects
Languages
4 languagesPython, TypeScript, Go, and Rust
Repository distribution
The benchmark spans 12 high-quality open-source repositories, selected for diversity in domain, size, and complexity.| Repository | Tasks | Features | Language | Domain |
|---|---|---|---|---|
| pallets/jinja | 3 | 30 | Python | Template engine |
| pallets/click | 3 | 27 | Python | CLI framework |
| stanfordnlp/dspy | 4 | 23 | Python | LLM framework |
| dottxt-ai/outlines | 3 | 22 | Python | Structured generation |
| run-llama/llama_index | 3 | 16 | Python | RAG framework |
| python-pillow/Pillow | 3 | 15 | Python | Image processing |
| huggingface/datasets | 3 | 13 | Python | Data loading |
| go-chi/chi | 3 | 13 | Go | HTTP router |
| react-hook-form/react-hook-form | 2 | 11 | TypeScript | Form library |
| openai/tiktoken | 1 | 10 | Python | Tokenizer |
| typst/typst | 1 | 10 | Rust | Typesetting system |
| samuelcolvin/dirty-equals | 1 | 9 | Python | Testing utilities |
Selection criteria
Repositories were chosen based on:- Activity: High-quality, actively maintained projects
- Popularity: Wide user base and real-world usage
- Diversity: Mix of domains (ML, web, CLI, image processing, etc.)
- Test coverage: Strong test suites for accurate evaluation
- Complexity: Range from 10K to 100K+ lines of code
Language breakdown
The dataset includes four programming languages, reflecting real-world development diversity:Python
9 repositories, ~550 tasksDominant language covering ML frameworks, CLI tools, web libraries, and utilities
TypeScript
1 repository, ~50 tasksFrontend framework (react-hook-form)
Go
1 repository, ~40 tasksHTTP routing library (chi)
Rust
1 repository, ~12 tasksTypesetting system (typst)
Python dominates the dataset because many popular open-source projects (especially in ML/AI) use Python. This reflects real-world development patterns where multi-agent collaboration is most likely to be deployed.
Task complexity metrics
Each task in CooperBench represents a real pull request that was decomposed into 2+ independent features.Features per task
- Minimum: 2 features (required for cooperation)
- Maximum: 10 features (complex PRs)
- Average: ~5.1 features per task
- Total feature pairs: 199 pairs in full dataset
Task characteristics
Lines of code changed
Lines of code changed
Range: 10 to 500+ lines per feature
- Small features: 10-50 lines (bug fixes, simple additions)
- Medium features: 50-150 lines (new functions, refactoring)
- Large features: 150+ lines (major components, API changes)
Files modified
Files modified
Range: 1 to 15 files per feature
- Single-file features: Isolated changes to one module
- Multi-file features: Cross-cutting changes requiring coordination
- Test files: Every feature includes corresponding test changes
Conflict potential
Conflict potential
Measured by file overlap between features:
- High overlap (greater than 50%): 15% of pairs — requires careful coordination
- Medium overlap (20-50%): 35% of pairs — some shared files
- Low overlap (less than 20%): 50% of pairs — mostly independent
Test coverage
Test coverage
All tasks include comprehensive tests:
- Unit tests: Function-level validation
- Integration tests: Cross-module interaction
- Regression tests: Ensure existing behavior preserved
Evaluation subsets
For faster iteration during development and research, CooperBench provides two curated subsets:Lite subset
Lite subset
26 tasks, 100 pairs, 12 repositoriesQuick evaluation subset for rapid experimentation. Generated via uniform pair-level sampling with seed=42.
- Covers all 12 repositories
- Balanced distribution across domains
- ~15% of full dataset
- Runs in ~2 hours with parallel execution
Flash subset
Flash subset
20 tasks, 50 pairs, 11 repositoriesMinimal dev subset for rapid iteration. Sampled from lite subset.
- 11 of 12 repositories (excludes one)
- Fastest evaluation option
- ~8% of full dataset
- Runs in ~1 hour with parallel execution
Both subsets use uniform pair-level sampling (not task-level) to ensure representative difficulty distribution. A fixed seed (42) ensures reproducibility.
Dataset structure
Each task follows a consistent directory structure:File purposes
- feature.md: Natural language description given to agents as the task specification
- feature.patch: Ground truth implementation used for evaluation reference
- tests.patch: Test cases that agents must pass (added to repository before agent runs)
- combined.patch: The original PR showing how all features integrate
Statistics summary
| Metric | Value |
|---|---|
| Total tasks | 652 |
| Total repositories | 12 |
| Programming languages | 4 (Python, TypeScript, Go, Rust) |
| Total features | ~3,300 |
| Possible feature pairs | 199 |
| Lite subset pairs | 100 |
| Flash subset pairs | 50 |
| Average features per task | 5.1 |
| Average lines changed per feature | 75 |
| Average files modified per feature | 3.4 |
| Tasks with high file overlap | 15% |
| Tasks with test coverage | 100% |
Download the dataset
The full CooperBench dataset is available on Hugging Face. Clone it with: