Skip to main content
A dataset is the simplest possible interface: any Iterable[dict[str, Any]] where each dict has an "id" key and a "context" key. Built-in loaders, local JSONL files, and plain Python lists all work.

Required keys

KeyTypeDescription
idstr | intUnique identifier for the example. Used for caching and deduplication.
contextstrThe text passed to the system. May be a document, prompt, conversation, or any string.
Optional keys such as question, answer, choices, and correct_letter are used by specific evaluators when present.

Built-in datasets (42)

All HuggingFace datasets require the datasets extra:
uv sync --extra datasets

QA & reading comprehension

CLI nameDatasetNotes
hotpotqaHotpotQAMulti-hop QA
natural-questionsNatural QuestionsOpen-domain QA
musiqueMuSiQueMulti-hop QA (answerable subset)
narrativeqaNarrativeQADocument summaries
triviaqaTriviaQASearch context QA
framesFRAMESMulti-hop factual reasoning
qualityQuALITYLong-document multiple-choice QA
qasperQASPerScientific paper QA

Knowledge & multiple choice

CLI nameDatasetNotes
mmluMMLU4-choice; configurable per-subject (mmlu:anatomy)
mmlu-proMMLU-Pro10-choice harder variant
arc-challengeARC-ChallengeScience exam questions
truthfulqaTruthfulQAFactuality (generation)
gpqaGPQA DiamondGraduate-level QA (gated)
hellaswagHellaSwagCommonsense completion
winograndeWinoGrandeCoreference resolution

Reasoning & math

CLI nameDatasetNotes
gsm8kGSM8KGrade school math
dropDROPDiscrete reasoning over paragraphs
mathMATHCompetition mathematics
mgsmMGSMMultilingual math; configurable (mgsm:de, mgsm:ja)
bbhBIG-Bench Hard23 hard BIG-Bench tasks; configurable (bbh:causal_judgement)

Code generation

CLI nameDatasetNotes
humanevalHumanEvalExecution-based (pass@1)
mbppMBPPExecution-based (pass@1)

Summarization

CLI nameDatasetNotes
multi-newsMulti-NewsMulti-document summarization
dialogsumDialogSumDialogue summarization
qmsumQMSumQuery-based meeting summarization (via SCROLLS)
summscreenfdSummScreenFDTV transcript summarization (via SCROLLS)
meetingbankMeetingBankMeeting transcript summarization
govreportGovReportGovernment report summarization

NLI & fact verification

CLI nameDatasetNotes
contract-nliContractNLILegal NLI (via SCROLLS)
scifactSciFactScientific claim verification

Instruction following

CLI nameDatasetNotes
ifevalIFEvalProgrammatic constraint checking
alpaca-evalAlpacaEval805 instructions; best with --judge-url

Multi-turn

CLI nameDatasetNotes
mt-benchMT-Bench80 two-turn conversations; uses process_conversation()

Long context

CLI nameDatasetNotes
longbenchLongBenchConfigurable (longbench:qasper)
longbench-v2LongBench v2Harder variant
infinitebenchInfiniteBench100K+ token contexts
nolimaNoLiMaNeedle retrieval

Agent traces

CLI nameDatasetNotes
bfclBFCL v3Function calling
apigenAPIGenMulti-turn tool use
swebenchSWE-benchCoding agent traces
swebench-verifiedSWE-bench Verified500 validated problems
swebench-liteSWE-bench Lite300-problem subset

Multi-config datasets

Some datasets expose multiple configurations. Select a configuration with a :config suffix:
# MMLU by subject
context-bench --proxy http://localhost:7878 --dataset mmlu:anatomy

# Multilingual math
context-bench --proxy http://localhost:7878 --dataset mgsm:de
context-bench --proxy http://localhost:7878 --dataset mgsm:ja

# LongBench by task
context-bench --proxy http://localhost:7878 --dataset longbench:qasper

# BIG-Bench Hard by task
context-bench --proxy http://localhost:7878 --dataset bbh:causal_judgement
In the Python API, pass the full name string:
from context_bench.registry import load_dataset

dataset = load_dataset("mmlu:anatomy")

Local JSONL files

Pass a file path instead of a dataset name to load a local JSONL file:
context-bench --proxy http://localhost:7878 --dataset ./my_data.jsonl
Each line must be a valid JSON object with at least "id" and "context" keys:
{"id": "ex-001", "context": "The Eiffel Tower is in Paris.", "question": "Where is the Eiffel Tower?", "answer": "Paris"}
{"id": "ex-002", "context": "Water boils at 100°C at sea level.", "question": "At what temperature does water boil?", "answer": "100°C"}

Custom datasets

Any iterable of dicts works. A plain Python list is a valid dataset:
from context_bench import evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore

my_dataset = [
    {
        "id": "q1",
        "context": "The Eiffel Tower is located in Paris, France.",
        "question": "Where is the Eiffel Tower?",
        "answer": "Paris",
    },
    {
        "id": "q2",
        "context": "Python was created by Guido van Rossum in 1991.",
        "question": "Who created Python?",
        "answer": "Guido van Rossum",
    },
]

result = evaluate(
    systems=[my_system],
    dataset=my_dataset,
    evaluators=[AnswerQuality()],
    metrics=[MeanScore(score_field="f1")],
)
You can also register a loader function so the dataset is accessible by name from the CLI:
from context_bench.registry import register_dataset

def load_my_dataset(**kwargs):
    return [
        {"id": "q1", "context": "...", "answer": "..."},
        ...
    ]

register_dataset("my-data", load_my_dataset)
Then use it like any built-in dataset:
context-bench --proxy http://localhost:7878 --dataset my-data
The dataset field in each example dict is used by PerDatasetBreakdown to slice results per dataset. If you mix examples from multiple sources in one list, set example["dataset"] to tag them.

Build docs developers (and LLMs) love