Skip to main content

Quickstart

Benchmark your first proxy in under 5 minutes

CLI Reference

Full reference for all CLI flags and subcommands

Python API

Use the Python API for custom evaluators and metrics

42 Built-in Datasets

QA, reasoning, code, summarization, long context, and more

What is context-bench?

You built (or bought) something that modifies LLM context. Now you need to answer:
  • Does compression destroy information? Measure quality with F1, exact match, and pass rate against ground-truth QA datasets.
  • Is the cost worth it? Track compression ratio and cost-per-successful-completion side by side.
  • Which approach wins? Run multiple systems on the same dataset in one call and get a comparison table.
context-bench gives you a single CLI command (or Python evaluate() call) that runs your system against a dataset, scores every example, and aggregates the results — no boilerplate, no framework lock-in.
# Benchmark a proxy in one command
context-bench --proxy http://localhost:7878 --dataset hotpotqa -n 50

# Compare two proxies head-to-head
context-bench \
  --proxy http://localhost:7878 --name kompact \
  --proxy http://localhost:8787 --name headroom \
  --dataset hotpotqa -n 50

How it works

context-bench follows a four-stage pipeline:
Dataset → System → Evaluator → Metric
  1. Dataset — any Iterable[dict] with "id" and "context" keys (42 built-in, or bring your own JSONL)
  2. System — the thing you’re benchmarking: implements .process(example) -> dict
  3. Evaluator — scores before/after: implements .score(original, processed) -> dict[str, float]
  4. Metric — aggregates scores: implements .compute(rows) -> dict[str, float]
All interfaces are typing.Protocol — implement the methods, never subclass.

Key features

42 built-in datasets

QA, reasoning, code generation, summarization, NLI, instruction following, long context, and agent traces — all ready to use

8 auto-wired evaluators

Answer quality (F1/EM), math equivalence, code execution, multiple-choice accuracy, NLI, IFEval, and LLM-as-judge

7 aggregation metrics

Mean score, pass rate, compression ratio, cost-of-pass, latency, per-dataset breakdown, and Pareto ranking

Multi-system comparison

Run multiple systems in a single call and get a Pareto frontier analysis automatically

Resumable runs

Cache results to disk and resume interrupted evaluations without re-running completed examples

Memory system benchmarking

Evaluate stateful memory systems (naive, embedding, Mem0, Zep) on LoCoMo and LongMemEval

Example output

$ context-bench \
    --proxy http://localhost:9091 --name Baseline \
    --proxy http://localhost:7878 --name Kompact \
    --proxy http://localhost:7879 --name Headroom \
    --dataset bfcl --model haiku --score-field contains

# Evaluation Results

| System   | mean_score | pass_rate | compression_ratio | cost_of_pass |
|----------|-----------|-----------|-------------------|--------------|
| Baseline | 0.2930    | 0.2930    | -0.3264           | 4,291        |
| Kompact  | 0.3640    | 0.3640    | -0.1345           | 2,447        |
| Headroom | 0.3140    | 0.3140    | -0.1793           | 3,815        |

*1,431 examples evaluated*

Build docs developers (and LLMs) love