Skip to main content
1

Install context-bench

Install using uv (recommended):
uv sync
To include HuggingFace dataset support:
uv sync --extra datasets
2

Start your proxy

context-bench benchmarks any OpenAI-compatible proxy. Start Kompact as an example:
uv run kompact proxy --port 7878
Or use Headroom:
pip install "headroom-ai[proxy]"
headroom proxy --port 8787
3

Run the benchmark

Point context-bench at your proxy and choose a dataset:
context-bench --proxy http://localhost:7878 --dataset hotpotqa -n 50
The -n 50 flag limits evaluation to 50 examples for a quick smoke test. Remove it to evaluate the full dataset.
4

Read the output

You’ll see a markdown table with aggregated metrics:
# Evaluation Results

| System    | mean_score | pass_rate | compression_ratio | cost_of_pass |
|-----------|-----------|-----------|-------------------|--------------|
| localhost | 0.3640    | 0.3640    | -0.1345           | 2,447        |

*50 examples evaluated*
ColumnWhat it means
mean_scoreAverage F1 score across all examples (0–1)
pass_rateFraction of examples scoring above the threshold (default: 0.7)
compression_ratio1 - (output_tokens / input_tokens) — positive means tokens were saved
cost_of_passTokens spent per successful completion

Compare two proxies

Run both systems in a single command to get a side-by-side comparison:
context-bench \
  --proxy http://localhost:7878 --name kompact \
  --proxy http://localhost:8787 --name headroom \
  --dataset hotpotqa -n 50
Output:
| System   | mean_score | pass_rate | compression_ratio | cost_of_pass |
|----------|-----------|-----------|-------------------|--------------|
| kompact  | 0.3640    | 0.3640    | -0.1345           | 2,447        |
| headroom | 0.3140    | 0.3140    | -0.1793           | 3,815        |
When two or more systems are compared, Pareto ranking is added automatically to identify which system dominates on both quality and cost.

What’s next

CLI reference

All CLI flags and subcommands

Core concepts

How the Dataset → System → Evaluator → Metric pipeline works

Built-in datasets

42 datasets ready to use

Custom systems

Benchmark any system, not just proxies

Build docs developers (and LLMs) love