context-bench. It runs one or more OpenAI-compatible proxies against one or more datasets and reports quality, compression, cost, and latency metrics in a comparison table.
Usage
Flags
OpenAI-compatible proxy URL to benchmark. Repeatable — pass multiple
--proxy flags to compare systems side by side.Display name for the corresponding
--proxy in the results table. Repeatable, paired positionally with each --proxy. When omitted, the hostname is extracted from the URL automatically.Dataset to benchmark against. Repeatable — pass multiple
--dataset flags to run across several datasets in one command. Accepts a known dataset name or a path to a local .jsonl file.See built-in datasets for the full list of names.Model name passed through to the proxy in every request.
Maximum number of examples to evaluate per dataset. Useful for quick smoke tests.
Output format. One of
table, json, or html.table— markdown table printed to stdoutjson— JSON object with full per-example resultshtml— self-contained HTML report (redirect to a file with> report.html)
Score field from
AnswerQuality to use as the primary metric for MeanScore, PassRate, and CostOfPass. Common values: f1, exact_match, recall, contains, mc_accuracy, pass_at_1, math_equiv.Score threshold for
PassRate and CostOfPass. An example is considered a “pass” when its score meets or exceeds this value.OpenAI-compatible URL for LLM-as-judge evaluation. When provided, a
LLMJudge evaluator is added that rates responses 1–5 (normalized to 0–1 as judge_score). Recommended for open-ended generation tasks such as alpaca-eval and mt-bench.Model name for the LLM judge when
--judge-url is set.Number of concurrent threads used to call the proxy. When omitted, examples are evaluated one at a time. Increase this value for faster runs when your proxy supports concurrent requests.
Directory for result caching. When set, completed rows are written to disk after each example so an interrupted run can resume without re-evaluating completed examples.
Examples
Example output
Built-in dataset names
Pass any of these names to--dataset. Datasets are loaded from HuggingFace and require pip install -e ".[datasets]" (or uv sync --extra datasets).
Multi-config datasets
Some datasets accept a:config suffix to select a subset:
longbench, infinitebench, bbh, mmlu, mgsm.
Local
.jsonl files are also accepted. Each record must have "id" and "context" keys.