Quickstart
Benchmark your first proxy in under 5 minutes
CLI reference
Full reference for all CLI flags and subcommands
Python API
Use the Python API for custom evaluators and metrics
42 built-in datasets
QA, reasoning, code, summarization, long context, and more
The four-stage pipeline
Every context-bench evaluation follows the same four stages:- Dataset — any
Iterable[dict]with"id"and"context"keys. Use one of the 42 built-in loaders or pass your own JSONL file. - System — the thing you are benchmarking. Implements
.nameand.process(example) -> dict. - Evaluator — compares the original example to the processed output. Implements
.nameand.score(original, processed) -> dict[str, float]. - Metric — aggregates per-example scores into a summary. Implements
.nameand.compute(rows) -> dict[str, float].
All interfaces are
typing.Protocol. Implement the required methods on any class — never subclass context-bench internals.Key features
42 built-in datasets
QA, reasoning, code generation, summarization, NLI, instruction following, long context, and agent traces — all ready to use out of the box
8 auto-wired evaluators
Evaluators are selected automatically based on the datasets you run — no manual wiring needed
7 aggregation metrics
Mean score, pass rate, compression ratio, cost-of-pass, latency, per-dataset breakdown, and Pareto ranking
Multi-system comparison
Run multiple systems in a single call and get a Pareto frontier analysis automatically
Resumable runs
Cache results to disk and resume interrupted evaluations without repeating completed examples
Memory benchmarking
Evaluate stateful memory systems (naive, embedding, Mem0, Zep) on LoCoMo and LongMemEval
DSPy optimizer sweep
Run a factorial sweep of DSPy optimizers across datasets, tracking compile cost and inference scores
