Quickstart

CLI
Python API

Install context-bench

Install using uv (recommended):

uv sync

To include HuggingFace dataset support:

uv sync --extra datasets

Start your proxy

context-bench benchmarks any OpenAI-compatible proxy. Start Kompact as an example:

uv run kompact proxy --port 7878

Or use Headroom:

pip install "headroom-ai[proxy]"
headroom proxy --port 8787

Run the benchmark

Point context-bench at your proxy and choose a dataset:

context-bench --proxy http://localhost:7878 --dataset hotpotqa -n 50

The -n 50 flag limits evaluation to 50 examples for a quick smoke test. Remove it to evaluate the full dataset.

Read the output

You’ll see a markdown table with aggregated metrics:

# Evaluation Results

| System    | mean_score | pass_rate | compression_ratio | cost_of_pass |
|-----------|-----------|-----------|-------------------|--------------|
| localhost | 0.3640    | 0.3640    | -0.1345           | 2,447        |

*50 examples evaluated*

Column	What it means
`mean_score`	Average F1 score across all examples (0–1)
`pass_rate`	Fraction of examples scoring above the threshold (default: 0.7)
`compression_ratio`	`1 - (output_tokens / input_tokens)` — positive means tokens were saved
`cost_of_pass`	Tokens spent per successful completion

Install context-bench

uv sync --extra datasets

Write your evaluation script

from context_bench import OpenAIProxy, evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore, PassRate, Latency
from context_bench.datasets.huggingface import hotpotqa

# Define the system to benchmark
kompact = OpenAIProxy(
    base_url="http://localhost:7878",
    model="gpt-4",
    name="kompact"
)

# Load a dataset
dataset = hotpotqa(n=50)

# Run evaluation
result = evaluate(
    systems=[kompact],
    dataset=dataset,
    evaluators=[AnswerQuality()],
    metrics=[
        MeanScore(score_field="f1"),
        PassRate(score_field="f1"),
        Latency(),
    ],
)

print(result.summary)

Export results

# JSON string
print(result.to_json())

# pandas DataFrame (requires pandas)
df = result.to_dataframe()
print(df.head())

# Filter to one system
sub = result.filter(system="kompact")

Compare two proxies

Run both systems in a single command to get a side-by-side comparison:

context-bench \
  --proxy http://localhost:7878 --name kompact \
  --proxy http://localhost:8787 --name headroom \
  --dataset hotpotqa -n 50

Output:

| System   | mean_score | pass_rate | compression_ratio | cost_of_pass |
|----------|-----------|-----------|-------------------|--------------|
| kompact  | 0.3640    | 0.3640    | -0.1345           | 2,447        |
| headroom | 0.3140    | 0.3140    | -0.1793           | 3,815        |

When two or more systems are compared, Pareto ranking is added automatically to identify which system dominates on both quality and cost.

What’s next

CLI reference

All CLI flags and subcommands

Core concepts

How the Dataset → System → Evaluator → Metric pipeline works

Built-in datasets

42 datasets ready to use

Custom systems

Benchmark any system, not just proxies

Get Started

CLI Reference

Core Concepts

Guides

Compare two proxies

What’s next

CLI reference

Core concepts

Built-in datasets

Custom systems

Build docs developers (and LLMs) love

Get Started

CLI Reference

Core Concepts

Guides

​Compare two proxies

​What’s next

CLI reference

Core concepts

Built-in datasets

Custom systems

Build docs developers (and LLMs) love

Compare two proxies

What’s next