Introduction

context-bench measures how well a system that touches the context window before an LLM sees it actually works, and what it costs. Prompt compressors, memory managers, context stuffers, RAG rerankers — if it transforms context, context-bench benchmarks it.

Quickstart

Benchmark your first proxy in under 5 minutes

CLI reference

Full reference for all CLI flags and subcommands

Python API

Use the Python API for custom evaluators and metrics

42 built-in datasets

QA, reasoning, code, summarization, long context, and more

The four-stage pipeline

Every context-bench evaluation follows the same four stages:

Dataset → System → Evaluator → Metric

Dataset — any Iterable[dict] with "id" and "context" keys. Use one of the 42 built-in loaders or pass your own JSONL file.
System — the thing you are benchmarking. Implements .name and .process(example) -> dict.
Evaluator — compares the original example to the processed output. Implements .name and .score(original, processed) -> dict[str, float].
Metric — aggregates per-example scores into a summary. Implements .name and .compute(rows) -> dict[str, float].

All interfaces are typing.Protocol. Implement the required methods on any class — never subclass context-bench internals.

Key features

42 built-in datasets

QA, reasoning, code generation, summarization, NLI, instruction following, long context, and agent traces — all ready to use out of the box

8 auto-wired evaluators

Evaluators are selected automatically based on the datasets you run — no manual wiring needed

7 aggregation metrics

Mean score, pass rate, compression ratio, cost-of-pass, latency, per-dataset breakdown, and Pareto ranking

Multi-system comparison

Run multiple systems in a single call and get a Pareto frontier analysis automatically

Resumable runs

Cache results to disk and resume interrupted evaluations without repeating completed examples

Memory benchmarking

Evaluate stateful memory systems (naive, embedding, Mem0, Zep) on LoCoMo and LongMemEval

DSPy optimizer sweep

Run a factorial sweep of DSPy optimizers across datasets, tracking compile cost and inference scores

Python API example

For full control, use the Python API directly:

from context_bench import OpenAIProxy, evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore, PassRate, Latency

kompact = OpenAIProxy("http://localhost:7878", model="claude-sonnet-4-5-20250929", name="kompact")
result = evaluate(
    systems=[kompact],
    dataset=your_dataset,
    evaluators=[AnswerQuality()],
    metrics=[MeanScore(score_field="f1"), PassRate(score_field="f1"), Latency()],
    max_workers=4,       # concurrent execution
    cache_dir=".cache/", # resume on re-run
)
print(result.summary)

For a guided walkthrough, continue to the Quickstart.

Get Started

CLI Reference

Core Concepts

Guides

Quickstart

CLI reference

Python API

42 built-in datasets

The four-stage pipeline

Key features

42 built-in datasets

8 auto-wired evaluators

7 aggregation metrics

Multi-system comparison

Resumable runs

Memory benchmarking

DSPy optimizer sweep

Python API example

Build docs developers (and LLMs) love

Get Started

CLI Reference

Core Concepts

Guides

Quickstart

CLI reference

Python API

42 built-in datasets

​The four-stage pipeline

​Key features

42 built-in datasets

8 auto-wired evaluators

7 aggregation metrics

Multi-system comparison

Resumable runs

Memory benchmarking

DSPy optimizer sweep

​Python API example

Build docs developers (and LLMs) love

The four-stage pipeline

Key features

Python API example