Proxy Benchmark

The proxy benchmark is the default mode of context-bench. It runs one or more OpenAI-compatible proxies against one or more datasets and reports quality, compression, cost, and latency metrics in a comparison table.

Usage

context-bench --proxy URL --dataset NAME [options]

Flags

--proxy

string

required

OpenAI-compatible proxy URL to benchmark. Repeatable — pass multiple --proxy flags to compare systems side by side.

context-bench --proxy http://localhost:7878 --dataset hotpotqa

--name

string

Display name for the corresponding --proxy in the results table. Repeatable, paired positionally with each --proxy. When omitted, the hostname is extracted from the URL automatically.

context-bench --proxy http://localhost:7878 --name kompact --dataset hotpotqa

--dataset

string

required

Dataset to benchmark against. Repeatable — pass multiple --dataset flags to run across several datasets in one command. Accepts a known dataset name or a path to a local .jsonl file.See built-in datasets for the full list of names.

--model

string

default:"claude-haiku-4-5-20251001"

Model name passed through to the proxy in every request.

-n / --max-examples

integer

default:"all"

Maximum number of examples to evaluate per dataset. Useful for quick smoke tests.

--output

string

default:"table"

Output format. One of table, json, or html.

table — markdown table printed to stdout
json — JSON object with full per-example results
html — self-contained HTML report (redirect to a file with > report.html)

--score-field

string

default:"f1"

Score field from AnswerQuality to use as the primary metric for MeanScore, PassRate, and CostOfPass. Common values: f1, exact_match, recall, contains, mc_accuracy, pass_at_1, math_equiv.

--threshold

number

default:"0.7"

Score threshold for PassRate and CostOfPass. An example is considered a “pass” when its score meets or exceeds this value.

--judge-url

string

OpenAI-compatible URL for LLM-as-judge evaluation. When provided, a LLMJudge evaluator is added that rates responses 1–5 (normalized to 0–1 as judge_score). Recommended for open-ended generation tasks such as alpaca-eval and mt-bench.

--judge-model

string

default:"claude-haiku-4-5-20251001"

Model name for the LLM judge when --judge-url is set.

--max-workers

integer

default:"sequential"

Number of concurrent threads used to call the proxy. When omitted, examples are evaluated one at a time. Increase this value for faster runs when your proxy supports concurrent requests.

--cache-dir

string

Directory for result caching. When set, completed rows are written to disk after each example so an interrupted run can resume without re-evaluating completed examples.

# First run — interrupted after 500 examples
context-bench --proxy http://localhost:7878 --dataset mmlu --cache-dir .cache/ -n 1000

# Re-run — picks up where it left off
context-bench --proxy http://localhost:7878 --dataset mmlu --cache-dir .cache/ -n 1000

Examples

context-bench --proxy http://localhost:7878 --dataset hotpotqa -n 50

Example output

# Evaluation Results

| System   | mean_score | pass_rate | compression_ratio | cost_of_pass |
|----------|-----------|-----------|-------------------|--------------|
| Baseline | 0.2930    | 0.2930    | -0.3264           | 4,291        |
| Kompact  | 0.3640    | 0.3640    | -0.1345           | 2,447        |
| Headroom | 0.3140    | 0.3140    | -0.1793           | 3,815        |

*1,431 examples evaluated*

When multiple systems are benchmarked, a Pareto rank is added to the summary comparing quality versus cost. When multiple datasets are passed, a per-dataset breakdown is appended automatically.

Built-in dataset names

Pass any of these names to --dataset. Datasets are loaded from HuggingFace and require pip install -e ".[datasets]" (or uv sync --extra datasets).

Multi-config datasets

Some datasets accept a :config suffix to select a subset:

--dataset mmlu:anatomy
--dataset mgsm:de
--dataset mgsm:ja
--dataset longbench:qasper
--dataset bbh:causal_judgement

Configurable datasets: longbench, infinitebench, bbh, mmlu, mgsm.

Local .jsonl files are also accepted. Each record must have "id" and "context" keys.

context-bench --proxy http://localhost:7878 --dataset ./my_data.jsonl

Get Started

CLI Reference

Core Concepts

Guides

Usage

Flags

Examples

Example output

Built-in dataset names

Multi-config datasets

Build docs developers (and LLMs) love

Get Started

CLI Reference

Core Concepts

Guides

​Usage

​Flags

​Examples

​Example output

​Built-in dataset names

​Multi-config datasets

Build docs developers (and LLMs) love

Usage

Flags

Examples

Example output

Built-in dataset names

Multi-config datasets