Skip to main content
context-bench ships with 42 dataset loaders organized into 10 categories. All HuggingFace-backed datasets require the datasets extra:
uv sync --extra datasets
Pass any dataset to the CLI with --dataset <name>. Multiple datasets can be combined in a single run:
context-bench \
  --proxy http://localhost:7878 \
  --dataset hotpotqa --dataset gsm8k --dataset mmlu

Multi-config datasets

Some datasets have sub-configurations selectable with a :config suffix:
DatasetExampleNotes
mmlu--dataset mmlu:anatomyAny MMLU subject
mgsm--dataset mgsm:deLanguage code (de, ja, zh, …)
longbench--dataset longbench:qasperAny LongBench task
infinitebench--dataset infinitebench:en_qaAny InfiniteBench task
bbh--dataset bbh:causal_judgementAny BIG-Bench Hard task

QA & Reading Comprehension

CLI nameDatasetNotes
hotpotqaHotpotQAMulti-hop QA
natural-questionsNatural QuestionsOpen-domain QA
musiqueMuSiQueMulti-hop QA (answerable subset)
narrativeqaNarrativeQADocument summaries
triviaqaTriviaQASearch context QA
framesFRAMESMulti-hop factual reasoning
qualityQuALITYLong-document multiple-choice QA
qasperQASPerScientific paper QA

Knowledge & Multiple Choice

CLI nameDatasetNotes
mmluMMLU57 subjects; 4-choice; configurable (mmlu:anatomy)
mmlu-proMMLU-Pro10-choice harder variant
arc-challengeARC-ChallengeScience exam questions
truthfulqaTruthfulQAFactuality (generation mode)
gpqaGPQA DiamondGraduate-level QA (gated dataset)
hellaswagHellaSwagCommonsense sentence completion
winograndeWinoGrandeCoreference resolution

Reasoning & Math

CLI nameDatasetNotes
gsm8kGSM8KGrade school math word problems
dropDROPDiscrete reasoning over paragraphs
mathMATHCompetition mathematics
mgsmMGSMMultilingual math; configurable (mgsm:de, mgsm:ja)
bbhBIG-Bench Hard23 hard BIG-Bench tasks; configurable (bbh:causal_judgement)

Code Generation

CLI nameDatasetNotes
humanevalHumanEvalExecution-based; scored by CodeExecution (pass@1)
mbppMBPPExecution-based; scored by CodeExecution (pass@1)

Summarization

CLI nameDatasetNotes
multi-newsMulti-NewsMulti-document summarization
dialogsumDialogSumDialogue summarization
qmsumQMSumQuery-based meeting summarization (via SCROLLS)
summscreenfdSummScreenFDTV transcript summarization (via SCROLLS)
meetingbankMeetingBankMeeting transcript summarization
govreportGovReportGovernment report summarization
Summarization datasets auto-enable the SummarizationQuality evaluator, which computes ROUGE-L precision, recall, and F1.

NLI & Fact Verification

CLI nameDatasetNotes
contract-nliContractNLILegal NLI (via SCROLLS)
scifactSciFactScientific claim verification

Instruction Following

CLI nameDatasetNotes
ifevalIFEvalProgrammatic constraint checking (19 constraint types)
alpaca-evalAlpacaEval805 open-ended instructions; best used with --judge-url

Multi-Turn

CLI nameDatasetNotes
mt-benchMT-Bench80 two-turn conversations; requires process_conversation()

Long Context

CLI nameDatasetNotes
longbenchLongBenchMulti-task long-context benchmark; configurable (longbench:qasper)
longbench-v2LongBench v2Harder variant with more difficult tasks
infinitebenchInfiniteBench100K+ token contexts; configurable (infinitebench:en_qa)
nolimaNoLiMaNeedle-in-a-haystack retrieval benchmark

Agent Traces

CLI nameDatasetNotes
bfclBFCL v3Function calling benchmark
apigenAPIGenMulti-turn tool use traces
swebenchSWE-benchFull coding agent traces
swebench-verifiedSWE-bench Verified500 validated problems
swebench-liteSWE-bench Lite300-problem subset

Build docs developers (and LLMs) love