Quickstart
Benchmark your first proxy in under 5 minutes
CLI Reference
Full reference for all CLI flags and subcommands
Python API
Use the Python API for custom evaluators and metrics
42 Built-in Datasets
QA, reasoning, code, summarization, long context, and more
What is context-bench?
You built (or bought) something that modifies LLM context. Now you need to answer:- Does compression destroy information? Measure quality with F1, exact match, and pass rate against ground-truth QA datasets.
- Is the cost worth it? Track compression ratio and cost-per-successful-completion side by side.
- Which approach wins? Run multiple systems on the same dataset in one call and get a comparison table.
evaluate() call) that runs your system against a dataset, scores every example, and aggregates the results — no boilerplate, no framework lock-in.
How it works
context-bench follows a four-stage pipeline:- Dataset — any
Iterable[dict]with"id"and"context"keys (42 built-in, or bring your own JSONL) - System — the thing you’re benchmarking: implements
.process(example) -> dict - Evaluator — scores before/after: implements
.score(original, processed) -> dict[str, float] - Metric — aggregates scores: implements
.compute(rows) -> dict[str, float]
typing.Protocol — implement the methods, never subclass.
Key features
42 built-in datasets
QA, reasoning, code generation, summarization, NLI, instruction following, long context, and agent traces — all ready to use
8 auto-wired evaluators
Answer quality (F1/EM), math equivalence, code execution, multiple-choice accuracy, NLI, IFEval, and LLM-as-judge
7 aggregation metrics
Mean score, pass rate, compression ratio, cost-of-pass, latency, per-dataset breakdown, and Pareto ranking
Multi-system comparison
Run multiple systems in a single call and get a Pareto frontier analysis automatically
Resumable runs
Cache results to disk and resume interrupted evaluations without re-running completed examples
Memory system benchmarking
Evaluate stateful memory systems (naive, embedding, Mem0, Zep) on LoCoMo and LongMemEval
