Signature
Parameters
A list of objects implementing the
System protocol. Each system must expose a .name string property and a .process(example) -> dict method. For multi-turn datasets, systems may also implement .process_conversation(turns) -> list[dict].Any iterable of example dicts. Each dict should have an
"id" key for stable cache lookups and a "context" key containing the text to process. Additional keys ("question", "answer", "turns", etc.) are passed through to evaluators unchanged.A list of objects implementing the
Evaluator protocol. Each evaluator must expose a .name string property and a .score(original, processed) -> dict[str, float] method.Optional list of objects implementing the
Metric protocol. Each metric must expose a .name string property and a .compute(rows) -> dict[str, float] method. If None, no aggregation metrics are computed and EvalResult.summary will contain empty dicts per system.Limit the number of examples processed. Useful for quick smoke tests. If
None, all examples in dataset are evaluated.Show progress during evaluation. Uses
rich progress bars when the package is installed, otherwise falls back to plain stderr output every 10 examples.List of dict keys to count tokens for. If
None, tokens are counted across all string values in the example dict. Pass a specific list (e.g. ["context", "response"]) to restrict token counting to those fields.Maximum number of concurrent threads for processing examples. If
None or 1, examples are processed sequentially. Values greater than 1 use a ThreadPoolExecutor. Concurrent mode is recommended when your system makes HTTP requests and latency is the bottleneck.Directory for result caching. When provided, completed
EvalRow results are saved to disk and reused on subsequent runs with the same configuration. This enables resuming interrupted evaluations without re-running completed examples.Returns
Returns anEvalResult containing all per-row scores and summary statistics.
Example
Comparing multiple systems
Pass more than one system to get a side-by-side comparison in a single call:Custom system
Any class that exposes.name and .process() works — no subclassing required:
Auto-wiring evaluators in the CLI
When you use thecontext-bench CLI, evaluators are auto-wired based on the datasets you select — you do not need to specify them manually:
| Evaluator | Auto-wired for |
|---|---|
AnswerQuality | All datasets |
SummarizationQuality | Summarization datasets |
MultipleChoiceAccuracy | MMLU, ARC, GPQA, HellaSwag, WinoGrande, MMLU-Pro |
CodeExecution | HumanEval, MBPP |
MathEquivalence | MATH, GSM8K, MGSM |
NLILabelMatch | ContractNLI, SciFact |
IFEvalChecker | IFEval |
LLMJudge | Any dataset via --judge-url |
evaluators argument.
evaluate_memory()
For systems implementing theMemorySystem protocol (ingest + query loop), use evaluate_memory instead:
system.reset()— clear statesystem.ingest(example.items)— load conversation turns or document chunks- For each query:
system.query(query.question)→ score against ground-truth answer
evaluate_memory accepts both the typed BenchmarkExample format and the legacy dict format with "turns" and "qa_pairs" keys. It returns the same EvalResult as evaluate().
evaluate_memory does not support max_workers or cache_dir. For high-throughput memory evaluation, run multiple systems sequentially and combine results manually.