Benchmarking a proxy

Any proxy that speaks the OpenAI chat completions API can be benchmarked with a single command. This guide walks through starting a proxy, running an evaluation, reading the results, and exporting the data.

Start your proxy

Start the proxy

Start the proxy server and note the port it listens on.

uv run kompact proxy --port 7878

Wait until the proxy logs that it is ready before proceeding.

Run context-bench

Point --proxy at the running server and choose a dataset with --dataset.

context-bench --proxy http://localhost:7878 --dataset hotpotqa -n 50

The -n 50 flag limits the run to 50 examples — useful for a quick sanity check before committing to the full dataset.

Read the output table

When the run finishes, context-bench prints a summary table:

| System   | mean_score | pass_rate | compression_ratio | cost_of_pass |
|----------|-----------|-----------|-------------------|--------------|
| Kompact  | 0.3640    | 0.3640    | -0.1345           | 2,447        |

Column	What it measures
`mean_score`	Average score across all examples (default: F1, configurable with `--score-field`)
`pass_rate`	Fraction of examples scoring above the threshold (default: 0.7, set with `--threshold`)
`compression_ratio`	`1 - (output_tokens / input_tokens)`. Positive = the proxy reduced context size.
`cost_of_pass`	Total tokens spent per successful completion. Lower is better.

Python API

For full control over evaluators, metrics, and export, use OpenAIProxy and evaluate() directly.

from context_bench import OpenAIProxy, evaluate
from context_bench.evaluators import AnswerQuality
from context_bench.metrics import MeanScore, PassRate, Latency

kompact = OpenAIProxy(
    base_url="http://localhost:7878",
    model="claude-sonnet-4-5-20250929",
    name="kompact",
)

result = evaluate(
    systems=[kompact],
    dataset=your_dataset,
    evaluators=[AnswerQuality()],
    metrics=[MeanScore(score_field="f1"), PassRate(score_field="f1"), Latency()],
    max_workers=4,
    cache_dir=".cache/",
)
print(result.summary)

OpenAIProxy constructor options

OpenAIProxy(
    base_url="http://localhost:8080",
    model="gpt-4",
    api_key="sk-...",              # or set OPENAI_API_KEY env var
    system_prompt="Be concise.",   # prepended as a system message
    extra_body={"temperature": 0}, # merged into every request body
    name="my-proxy",               # display name in results
    timeout=30.0,                  # HTTP request timeout in seconds
    max_retries=3,                 # retries on 429 / 5xx
)

Export results

result.to_json()                    # JSON string
result.to_dataframe()               # pandas DataFrame (requires pandas)
result.filter(system="kompact")     # filter to one system's rows

Caching and resumable runs

Pass --cache-dir (or cache_dir= in Python) to write completed rows to disk. On a subsequent run with the same arguments, already-completed rows are loaded from cache and skipped — the run picks up exactly where it left off.

# First run — interrupted after 500 examples
context-bench --proxy http://localhost:7878 --dataset mmlu \
  --cache-dir .cache/ -n 1000

# Re-run — picks up from example 501
context-bench --proxy http://localhost:7878 --dataset mmlu \
  --cache-dir .cache/ -n 1000

The cache key is derived from the system name, dataset name, example ID, and evaluator list. Changing any of these will cause a cache miss for affected rows.

Cookbook

Smoke test (10 examples)

Run a quick sanity check before committing to a full evaluation:

context-bench --proxy http://localhost:7878 --dataset hotpotqa -n 10

Full evaluation with LLM judge

Add --judge-url to score responses with an external LLM on a 1–5 scale (normalized to 0–1 as judge_score):

context-bench \
  --proxy http://localhost:7878 --name my-system \
  --dataset hotpotqa --dataset mmlu --dataset gsm8k \
  --judge-url http://localhost:9090 \
  --cache-dir .bench-cache/ \
  --max-workers 8 \
  --output html -n 200 > report.html

Resume an interrupted run

# Start with cache enabled
context-bench --proxy http://localhost:7878 --dataset mmlu \
  --cache-dir .cache/ -n 1000

# Re-run after interruption — continues from where it stopped
context-bench --proxy http://localhost:7878 --dataset mmlu \
  --cache-dir .cache/ -n 1000

Get Started

CLI Reference

Core Concepts

Guides

Start your proxy

Python API

OpenAIProxy constructor options

Export results

Caching and resumable runs

Cookbook

Build docs developers (and LLMs) love

Get Started

CLI Reference

Core Concepts

Guides

​Start your proxy

​Python API

​OpenAIProxy constructor options

​Export results

​Caching and resumable runs

​Cookbook

Build docs developers (and LLMs) love

Start your proxy

Python API

OpenAIProxy constructor options

Export results

Caching and resumable runs

Cookbook