Start your proxy
Start the proxy
Start the proxy server and note the port it listens on.Wait until the proxy logs that it is ready before proceeding.
Run context-bench
Point The
--proxy at the running server and choose a dataset with --dataset.-n 50 flag limits the run to 50 examples — useful for a quick sanity check before committing to the full dataset.Read the output table
When the run finishes, context-bench prints a summary table:
| Column | What it measures |
|---|---|
mean_score | Average score across all examples (default: F1, configurable with --score-field) |
pass_rate | Fraction of examples scoring above the threshold (default: 0.7, set with --threshold) |
compression_ratio | 1 - (output_tokens / input_tokens). Positive = the proxy reduced context size. |
cost_of_pass | Total tokens spent per successful completion. Lower is better. |
Python API
For full control over evaluators, metrics, and export, useOpenAIProxy and evaluate() directly.
OpenAIProxy constructor options
Export results
Caching and resumable runs
Pass--cache-dir (or cache_dir= in Python) to write completed rows to disk. On a subsequent run with the same arguments, already-completed rows are loaded from cache and skipped — the run picks up exactly where it left off.
The cache key is derived from the system name, dataset name, example ID, and evaluator list. Changing any of these will cause a cache miss for affected rows.
Cookbook
Smoke test (10 examples)
Smoke test (10 examples)
Run a quick sanity check before committing to a full evaluation:
Full evaluation with LLM judge
Full evaluation with LLM judge
Add
--judge-url to score responses with an external LLM on a 1–5 scale (normalized to 0–1 as judge_score):Resume an interrupted run
Resume an interrupted run
