context-bench ships with 42 dataset loaders organized into 10 categories. All
HuggingFace-backed datasets require the datasets extra:
Pass any dataset to the CLI with --dataset <name>. Multiple datasets can be
combined in a single run:
context-bench \
--proxy http://localhost:7878 \
--dataset hotpotqa --dataset gsm8k --dataset mmlu
Multi-config datasets
Some datasets have sub-configurations selectable with a :config suffix:
| Dataset | Example | Notes |
|---|
mmlu | --dataset mmlu:anatomy | Any MMLU subject |
mgsm | --dataset mgsm:de | Language code (de, ja, zh, …) |
longbench | --dataset longbench:qasper | Any LongBench task |
infinitebench | --dataset infinitebench:en_qa | Any InfiniteBench task |
bbh | --dataset bbh:causal_judgement | Any BIG-Bench Hard task |
QA & Reading Comprehension
| CLI name | Dataset | Notes |
|---|
hotpotqa | HotpotQA | Multi-hop QA |
natural-questions | Natural Questions | Open-domain QA |
musique | MuSiQue | Multi-hop QA (answerable subset) |
narrativeqa | NarrativeQA | Document summaries |
triviaqa | TriviaQA | Search context QA |
frames | FRAMES | Multi-hop factual reasoning |
quality | QuALITY | Long-document multiple-choice QA |
qasper | QASPer | Scientific paper QA |
Knowledge & Multiple Choice
| CLI name | Dataset | Notes |
|---|
mmlu | MMLU | 57 subjects; 4-choice; configurable (mmlu:anatomy) |
mmlu-pro | MMLU-Pro | 10-choice harder variant |
arc-challenge | ARC-Challenge | Science exam questions |
truthfulqa | TruthfulQA | Factuality (generation mode) |
gpqa | GPQA Diamond | Graduate-level QA (gated dataset) |
hellaswag | HellaSwag | Commonsense sentence completion |
winogrande | WinoGrande | Coreference resolution |
Reasoning & Math
| CLI name | Dataset | Notes |
|---|
gsm8k | GSM8K | Grade school math word problems |
drop | DROP | Discrete reasoning over paragraphs |
math | MATH | Competition mathematics |
mgsm | MGSM | Multilingual math; configurable (mgsm:de, mgsm:ja) |
bbh | BIG-Bench Hard | 23 hard BIG-Bench tasks; configurable (bbh:causal_judgement) |
Code Generation
| CLI name | Dataset | Notes |
|---|
humaneval | HumanEval | Execution-based; scored by CodeExecution (pass@1) |
mbpp | MBPP | Execution-based; scored by CodeExecution (pass@1) |
Summarization
| CLI name | Dataset | Notes |
|---|
multi-news | Multi-News | Multi-document summarization |
dialogsum | DialogSum | Dialogue summarization |
qmsum | QMSum | Query-based meeting summarization (via SCROLLS) |
summscreenfd | SummScreenFD | TV transcript summarization (via SCROLLS) |
meetingbank | MeetingBank | Meeting transcript summarization |
govreport | GovReport | Government report summarization |
Summarization datasets auto-enable the SummarizationQuality evaluator, which
computes ROUGE-L precision, recall, and F1.
NLI & Fact Verification
| CLI name | Dataset | Notes |
|---|
contract-nli | ContractNLI | Legal NLI (via SCROLLS) |
scifact | SciFact | Scientific claim verification |
Instruction Following
| CLI name | Dataset | Notes |
|---|
ifeval | IFEval | Programmatic constraint checking (19 constraint types) |
alpaca-eval | AlpacaEval | 805 open-ended instructions; best used with --judge-url |
Multi-Turn
| CLI name | Dataset | Notes |
|---|
mt-bench | MT-Bench | 80 two-turn conversations; requires process_conversation() |
Long Context
| CLI name | Dataset | Notes |
|---|
longbench | LongBench | Multi-task long-context benchmark; configurable (longbench:qasper) |
longbench-v2 | LongBench v2 | Harder variant with more difficult tasks |
infinitebench | InfiniteBench | 100K+ token contexts; configurable (infinitebench:en_qa) |
nolima | NoLiMa | Needle-in-a-haystack retrieval benchmark |
Agent Traces
| CLI name | Dataset | Notes |
|---|
bfcl | BFCL v3 | Function calling benchmark |
apigen | APIGen | Multi-turn tool use traces |
swebench | SWE-bench | Full coding agent traces |
swebench-verified | SWE-bench Verified | 500 validated problems |
swebench-lite | SWE-bench Lite | 300-problem subset |