Pipeline overview
Dataset yields examples
The dataset is any
Iterable[dict]. Each dict must have an "id" key and a "context" key. Built-in loaders cover 42 datasets; you can also pass a plain list of dicts or a local JSONL file.System transforms each example
System.process(example) receives the raw dict and returns a new dict. This is the thing you are benchmarking — a compressor, a proxy, a memory manager, or anything else that touches context.Evaluator scores the transformation
Evaluator.score(original, processed) receives both the input and output dicts and returns a dict[str, float] of named scores such as f1, mc_accuracy, or pass_at_1.Stage details
Dataset
A dataset is anyIterable[dict[str, Any]]. The minimum required keys are:
| Key | Type | Description |
|---|---|---|
id | str | int | Unique identifier for the example |
context | str | The context text passed to the system |
question, answer, choices, and dataset are used by specific evaluators.
System
A system implements.name (a string property) and .process(example: dict) -> dict. The returned dict should include a "response" key with the system’s output. The runner passes the full example dict so the system can access any field — context, question, prior turns, etc.
Evaluator
An evaluator implements.name and .score(original, processed) -> dict[str, float]. It receives both the original example (before the system ran) and the processed dict (after). Evaluators are auto-wired based on which datasets you select — you do not configure them manually.
Metric
A metric implements.name and .compute(rows: list[EvalRow]) -> dict[str, float]. It receives all rows for a single system and returns aggregate statistics. Multiple metrics can run on the same rows.
Result data structures
EvalRow
Each(system, example) pair produces one EvalRow:
EvalResult
evaluate() returns an EvalResult that collects all rows and summary statistics:
Data flow in code
The coreevaluate() function in runner.py wires all four stages together:
evaluate() materializes the dataset, iterates over each system, calls _process_example() for every (system, example) pair, collects EvalRow objects, then runs each metric over the rows for that system.
Registry
TheRegistry is a name-based plugin system for datasets, metrics, and reporters. It lets you register components by name and look them up in config-driven workflows:
--dataset names. Registering your own loader makes it available as a CLI flag.
Protocol-based design
All four interfaces —System, Evaluator, Metric, and MemorySystem — are typing.Protocol. This means:
- You implement the required methods on any class you choose.
- You never subclass a context-bench base class.
- Structural typing applies: if your object has the right methods, it works.
Duck typing is intentional. context-bench does not use abstract base classes or inheritance. Implement the methods, not a class hierarchy.
