Dataset
The benchmark dataset contains 518 prompts across four categories, stored inbenchmarks/testdata/prompts.json.
Simple factual
Factual questions with a single, unambiguous correct answer. The drafter is expected to handle these confidently. Low entropy, high acceptance rate.
Multi-step reasoning
Questions requiring multi-step inference, arithmetic, or logical chaining. More likely to trigger uncertainty in the drafter.
Code generation
Requests to write, complete, or debug code. Drafter confidence varies widely depending on problem complexity.
Ambiguous / creative
Open-ended prompts with no single correct answer. Tests whether the gateway handles uncertainty in creative tasks without over-escalating.
collect tool, which sends each prompt to both the drafter and heavyweight model and records the per-token logprob sequences and judge verdicts.
LLM-as-judge evaluation
The benchmark uses an LLM to evaluate whether each draft response is acceptable. This avoids the need for hand-labeled ground truth and scales to any prompt domain.How it works
Thecollect tool calls gpt-4.1 (the heavyweight model) as the judge. For each prompt, the judge receives:
- The original prompt
- The drafter’s response
- A reference response generated by the heavyweight itself
- Factual accuracy compared to the reference
- Completeness of the answer
- Whether it would be acceptable to serve to a user
acceptable field. A score of 3 or above is treated as acceptable.
acceptable field from the judge is what the sweep uses as ground truth when computing the confusion matrix. See Threshold calibration for how those labels map to TP, TN, FP, and FN.
LLM-as-judge evaluation is not perfect. The judge can disagree with human raters, especially on ambiguous or creative prompts. The calibration results should be treated as indicative rather than ground truth for your specific workload. Re-calibrate with your own prompts if you need domain-specific accuracy guarantees.
Benchmark results at T=2.0
These are the measured results from the reference calibration run at the selected threshold T=2.0.| Metric | Value |
|---|---|
| Draft acceptance rate | 94% |
| Draft accuracy (LLM-as-judge) | 98.2% acceptable |
| Cost reduction vs. all-heavyweight | 91.6% |
| P99 latency (draft path) | 109ms at 50 req/s |
| Proxy overhead | < 5ms P99 |
Tools
The benchmark suite has three tools underbenchmarks/cmd/:
collect — gather token records and judge verdicts
collect — gather token records and judge verdicts
Sends every prompt in the dataset to both the drafter and heavyweight, runs LLM-as-judge evaluation, and writes results to a JSONL file.Flags
Requires
| Flag | Default | Description |
|---|---|---|
--prompts | benchmarks/testdata/prompts.json | Path to the prompt dataset |
--output | benchmarks/results/collected.jsonl | Output JSONL path |
--concurrency | 3 | Maximum concurrent API calls |
--rate-delay | 500ms | Delay between prompt dispatches |
--top-logprobs | 5 | Number of top logprobs to request per token |
--max-tokens | 1024 | Maximum tokens per completion |
OPENAI_API_KEY to be set in the environment. The tool is resumable: prompts already present in the output file (matched by ID) are skipped.The output JSONL has one record per prompt. Each record includes the per-token entropy values from the drafter, the draft and heavyweight responses, token counts, latency measurements, and the judge result.sweep — replay token records at multiple thresholds
sweep — replay token records at multiple thresholds
Replays the collected JSONL offline at each candidate threshold, computing routing decisions and accuracy/cost metrics without making any API calls.Flags
Prints a human-readable summary table to stdout and writes the full per-threshold metrics (including raw TP, FP, TN, FN counts) to the CSV file. The sweep auto-selects the threshold with the highest F1 score where draft accuracy ≥ 95% and prints it at the end of the summary.
| Flag | Default | Description |
|---|---|---|
--input | benchmarks/results/collected.jsonl | Collected JSONL input path |
--output | benchmarks/results/sweep.csv | Sweep CSV output path |
--thresholds | 0.5,0.75,1.0,1.25,1.5 | Comma-separated threshold values to sweep |
--window-size | 10 | Entropy window size (number of tokens) |
--early-exit-count | 10 | Number of tokens above threshold before escalating |
loadtest — measure gateway latency under load
loadtest — measure gateway latency under load
Runs a load test against the gateway using vegeta. Starts an in-process mock OpenAI server that serves synthetic streaming responses with configurable logprobs, then drives traffic against the real gateway at a specified rate.Flags
Scenarios
The tool prints P50, P95, P99, and Max latencies, success rate, and status code counts on completion.The P99 latency of 109ms at 50 req/s quoted in the README was measured with the
| Flag | Default | Description |
|---|---|---|
--target | http://localhost:8080/v1/chat/completions | Gateway endpoint to test |
--rate | 50 | Requests per second |
--duration | 30s | Test duration |
--mock-port | 9999 | Port for the in-process mock OpenAI server |
--scenario | confident | Test scenario: confident, mixed, or cache |
--tokens | 20 | Tokens per mock response |
--chunk-delay | 5ms | Delay between SSE chunks in the mock server |
| Scenario | Behaviour |
|---|---|
confident | All mock responses have low-entropy logprobs (95% probability on the top token). The gateway accepts every draft. |
mixed | 80% of responses are confident, 20% are uncertain (near-uniform logprobs). The gateway escalates ~20% of requests. |
cache | All requests send the same fixed prompt ("What is 2+2?"). After the first request, the semantic cache should serve subsequent requests directly. |
confident scenario, 20 tokens per response, 5ms chunk delay, and a 30-second test duration. This measures the full draft path through the gateway with the mock upstream providing instant synthetic responses — it captures gateway overhead (entropy analysis, routing logic, HTTP handling) without real model inference latency.Running the full benchmark suite
Start the gateway and infrastructure
The load test requires a running gateway. Start the infrastructure and gateway first.
Collect token records
Run the collector. This step calls the real OpenAI API and takes time proportional to dataset size and
--concurrency. With the default 518-prompt dataset and --concurrency 3, expect 20-40 minutes.Run the threshold sweep
Sweep candidate thresholds over the collected data. This is fast — no API calls, pure computation.The selected threshold is printed to stdout. Update
entropy.threshold in config.yaml and restart the gateway if the selected value differs from the current one.Interpreting results
Draft acceptance rate
Draft acceptance rate
The fraction of requests where the gateway served the drafter’s response directly. At T=2.0 on the reference dataset, this is 94%. A higher acceptance rate means lower cost per request but more exposure to FN errors (bad drafts served).In production, monitor this with
draftthinker_routing_decisions_total{decision="accept"} divided by total routing decisions.Draft accuracy
Draft accuracy
The fraction of accepted drafts that the LLM judge rated as acceptable (score ≥ 3). This is
TN / (TN + FN). At T=2.0, this is 98.2%.This figure is specific to the benchmark dataset and the judge model. On a domain-specific workload where the drafter is less reliable, accuracy at the same threshold will be lower. Re-calibrate with representative prompts if your workload differs significantly from the general-purpose benchmark set.Cost reduction
Cost reduction
The reduction in estimated token cost compared to routing every request to the heavyweight. At T=2.0, this is 91.6%. Cost is estimated from token counts and per-model pricing (drafter: 0.80/M output; heavyweight: 10.00/M output).
P99 latency (draft path)
P99 latency (draft path)
The 99th percentile end-to-end latency for requests that go through the draft path. At 50 req/s with a mock upstream and T=2.0, this is 109ms. This measures gateway processing overhead (entropy window computation, routing decision, HTTP stream handling), not upstream model inference time. Real-world latency includes model response time and network round trips to the API provider.
Proxy overhead
Proxy overhead
The gateway’s own processing time, excluding upstream model inference. Target is < 5ms P99. The entropy window algorithm is pure in-memory computation on a small sliding window of float64 values — overhead is dominated by I/O, not computation.