The dataset is available on HuggingFace and can be downloaded with:
Task construction methodology
Each task in CooperBench is derived from a real-world pull request:PR selection
Pull requests are selected from popular open-source repositories based on:
- Multiple independent features in a single PR
- Comprehensive test coverage
- Clear feature boundaries
Feature extraction
Each PR is analyzed to identify independent features that:
- Can be implemented separately
- Have distinct test suites
- May interact when merged together
Repository coverage
The benchmark spans 12 repositories across multiple languages and domains:| Repository | Language | Tasks | Features | Domain |
|---|---|---|---|---|
| dottxt-ai/outlines | Python | 3 | 22 | LLM framework |
| stanfordnlp/dspy | Python | 4 | 23 | NLP framework |
| go-chi/chi | Go | 3 | 13 | HTTP router |
| huggingface/datasets | Python | 3 | 13 | Dataset library |
| run-llama/llama_index | Python | 3 | 16 | Data framework |
| openai/tiktoken | Python | 1 | 10 | Tokenizer |
| pallets/click | Python | 3 | 27 | CLI framework |
| pallets/jinja | Python | 3 | 30 | Template engine |
| python-pillow/Pillow | Python | 3 | 15 | Image library |
| react-hook-form/react-hook-form | TypeScript | 2 | 11 | React forms |
| samuelcolvin/dirty-equals | Python | 1 | 9 | Testing utility |
| typst/typst | Rust | 1 | 10 | Typesetting |
Total: 30 base tasks, expanded to 652 unique task instances through feature pair combinations.
Directory structure
The dataset follows a consistent hierarchy:File descriptions
feature.md - Feature description
feature.md - Feature description
Natural language description of what the agent should implement. This is the “task prompt” given to agents.Example:
feature.patch - Gold implementation
feature.patch - Gold implementation
The reference implementation extracted from the original PR. Used for comparison and analysis, not given to agents.
tests.patch - Test suite
tests.patch - Test suite
Test cases that verify the feature implementation. Applied before running tests to validate agent output.
setup.sh - Environment setup
setup.sh - Environment setup
Prepares the repository environment:
run_tests.sh - Test execution
run_tests.sh - Test execution
Executes the test suite for verification:
Feature pairs and task instances
Each base task contains multiple features. The benchmark evaluates all possible feature pairs:Example: A task with 5 features generates C(5,2) = 10 unique task instances, each testing a different pair of features.
- Tests various conflict patterns
- Increases dataset diversity
- Provides multiple difficulty levels per repository
Dataset subsets
CooperBench provides curated subsets for faster iteration:- Flash (dev)
- Lite (quick eval)
- Full (complete)
20 tasks, 50 feature pairs across 11 repositories
- Designed for rapid development and debugging
- Sampled uniformly from the lite subset
- Fixed seed (42) for reproducibility
Subset generation
Subsets use uniform pair-level sampling to ensure diversity:Task difficulty and characteristics
Tasks vary across multiple dimensions:Code complexity
- Lines of code changed
- Number of files modified
- Depth of call chains
Feature independence
- Shared vs. separate files
- Overlapping vs. distinct modules
- Semantic dependencies
Test coverage
- Number of test cases
- Integration vs. unit tests
- Edge case coverage
Domain knowledge
- Framework-specific APIs
- Language idioms
- Architectural patterns
Accessing task data
You can programmatically access task information:What’s next?
Run tasks
Execute benchmark tasks with your agents
Understand settings
Learn about cooperative vs solo evaluation modes
System architecture
Explore how tasks are executed and evaluated
CLI reference
Complete command-line interface documentation