Benchmarks

Draft Thinker’s calibration results and performance numbers are generated from a reproducible benchmark suite. Every figure in the README and the calibration table is backed by a run of these tools.

Dataset

The benchmark dataset contains 518 prompts across four categories, stored in benchmarks/testdata/prompts.json.

Simple factual

Factual questions with a single, unambiguous correct answer. The drafter is expected to handle these confidently. Low entropy, high acceptance rate.

Multi-step reasoning

Questions requiring multi-step inference, arithmetic, or logical chaining. More likely to trigger uncertainty in the drafter.

Code generation

Requests to write, complete, or debug code. Drafter confidence varies widely depending on problem complexity.

Ambiguous / creative

Open-ended prompts with no single correct answer. Tests whether the gateway handles uncertainty in creative tasks without over-escalating.

The dataset is the input to the collect tool, which sends each prompt to both the drafter and heavyweight model and records the per-token logprob sequences and judge verdicts.

LLM-as-judge evaluation

The benchmark uses an LLM to evaluate whether each draft response is acceptable. This avoids the need for hand-labeled ground truth and scales to any prompt domain.

How it works

The collect tool calls gpt-4.1 (the heavyweight model) as the judge. For each prompt, the judge receives:

The original prompt
The drafter’s response
A reference response generated by the heavyweight itself

The judge scores the draft on three criteria:

Factual accuracy compared to the reference
Completeness of the answer
Whether it would be acceptable to serve to a user

The judge returns a score from 1 to 5 and a boolean acceptable field. A score of 3 or above is treated as acceptable.

= completely wrong or incoherent
= partially correct but missing critical information
= mostly correct, acceptable quality
= good quality, minor differences from reference
= equivalent or better than reference

The acceptable field from the judge is what the sweep uses as ground truth when computing the confusion matrix. See Threshold calibration for how those labels map to TP, TN, FP, and FN.

LLM-as-judge evaluation is not perfect. The judge can disagree with human raters, especially on ambiguous or creative prompts. The calibration results should be treated as indicative rather than ground truth for your specific workload. Re-calibrate with your own prompts if you need domain-specific accuracy guarantees.

Benchmark results at T=2.0

These are the measured results from the reference calibration run at the selected threshold T=2.0.

Metric	Value
Draft acceptance rate	94%
Draft accuracy (LLM-as-judge)	98.2% acceptable
Cost reduction vs. all-heavyweight	91.6%
P99 latency (draft path)	109ms at 50 req/s
Proxy overhead	< 5ms P99

The latency figures come from the load test tool described below. The accuracy and cost figures come from the threshold sweep.

Tools

The benchmark suite has three tools under benchmarks/cmd/:

collect — gather token records and judge verdicts

Sends every prompt in the dataset to both the drafter and heavyweight, runs LLM-as-judge evaluation, and writes results to a JSONL file.

go run ./benchmarks/cmd/collect/ \
  --prompts benchmarks/testdata/prompts.json \
  --output benchmarks/results/collected.jsonl \
  --concurrency 3 \
  --rate-delay 500ms \
  --top-logprobs 5 \
  --max-tokens 1024

Flags

Flag	Default	Description
`--prompts`	`benchmarks/testdata/prompts.json`	Path to the prompt dataset
`--output`	`benchmarks/results/collected.jsonl`	Output JSONL path
`--concurrency`	`3`	Maximum concurrent API calls
`--rate-delay`	`500ms`	Delay between prompt dispatches
`--top-logprobs`	`5`	Number of top logprobs to request per token
`--max-tokens`	`1024`	Maximum tokens per completion

Requires OPENAI_API_KEY to be set in the environment. The tool is resumable: prompts already present in the output file (matched by ID) are skipped.The output JSONL has one record per prompt. Each record includes the per-token entropy values from the drafter, the draft and heavyweight responses, token counts, latency measurements, and the judge result.

sweep — replay token records at multiple thresholds

Replays the collected JSONL offline at each candidate threshold, computing routing decisions and accuracy/cost metrics without making any API calls.

go run ./benchmarks/cmd/sweep/ \
  --input benchmarks/results/collected.jsonl \
  --output benchmarks/results/sweep.csv \
  --thresholds 0.5,0.75,1.0,1.25,1.5,1.75,2.0,2.25,2.5 \
  --window-size 10 \
  --early-exit-count 10

Flags

Flag	Default	Description
`--input`	`benchmarks/results/collected.jsonl`	Collected JSONL input path
`--output`	`benchmarks/results/sweep.csv`	Sweep CSV output path
`--thresholds`	`0.5,0.75,1.0,1.25,1.5`	Comma-separated threshold values to sweep
`--window-size`	`10`	Entropy window size (number of tokens)
`--early-exit-count`	`10`	Number of tokens above threshold before escalating

Prints a human-readable summary table to stdout and writes the full per-threshold metrics (including raw TP, FP, TN, FN counts) to the CSV file. The sweep auto-selects the threshold with the highest F1 score where draft accuracy ≥ 95% and prints it at the end of the summary.

loadtest — measure gateway latency under load

Runs a load test against the gateway using vegeta. Starts an in-process mock OpenAI server that serves synthetic streaming responses with configurable logprobs, then drives traffic against the real gateway at a specified rate.

go run ./benchmarks/cmd/loadtest/ \
  --target http://localhost:8080/v1/chat/completions \
  --rate 50 \
  --duration 30s \
  --scenario confident \
  --tokens 20 \
  --chunk-delay 5ms

Flags

Flag	Default	Description
`--target`	`http://localhost:8080/v1/chat/completions`	Gateway endpoint to test
`--rate`	`50`	Requests per second
`--duration`	`30s`	Test duration
`--mock-port`	`9999`	Port for the in-process mock OpenAI server
`--scenario`	`confident`	Test scenario: `confident`, `mixed`, or `cache`
`--tokens`	`20`	Tokens per mock response
`--chunk-delay`	`5ms`	Delay between SSE chunks in the mock server

Scenarios

Scenario	Behaviour
`confident`	All mock responses have low-entropy logprobs (95% probability on the top token). The gateway accepts every draft.
`mixed`	80% of responses are confident, 20% are uncertain (near-uniform logprobs). The gateway escalates ~20% of requests.
`cache`	All requests send the same fixed prompt (`"What is 2+2?"`). After the first request, the semantic cache should serve subsequent requests directly.

The tool prints P50, P95, P99, and Max latencies, success rate, and status code counts on completion.The P99 latency of 109ms at 50 req/s quoted in the README was measured with the confident scenario, 20 tokens per response, 5ms chunk delay, and a 30-second test duration. This measures the full draft path through the gateway with the mock upstream providing instant synthetic responses — it captures gateway overhead (entropy analysis, routing logic, HTTP handling) without real model inference latency.

Running the full benchmark suite

Start the gateway and infrastructure

The load test requires a running gateway. Start the infrastructure and gateway first.

docker compose up -d
export OPENAI_API_KEY=sk-...
go build -o draft-thinker ./cmd/gateway
./draft-thinker --config config.yaml

Collect token records

Run the collector. This step calls the real OpenAI API and takes time proportional to dataset size and --concurrency. With the default 518-prompt dataset and --concurrency 3, expect 20-40 minutes.

go run ./benchmarks/cmd/collect/ \
  --prompts benchmarks/testdata/prompts.json \
  --output benchmarks/results/collected.jsonl \
  --concurrency 3

Run the threshold sweep

Sweep candidate thresholds over the collected data. This is fast — no API calls, pure computation.

go run ./benchmarks/cmd/sweep/ \
  --input benchmarks/results/collected.jsonl \
  --output benchmarks/results/sweep.csv \
  --thresholds 1.0,1.25,1.5,1.75,2.0,2.25,2.5

The selected threshold is printed to stdout. Update entropy.threshold in config.yaml and restart the gateway if the selected value differs from the current one.

Run the load test

Run the load test against the running gateway with the confident scenario to measure draft-path latency.

go run ./benchmarks/cmd/loadtest/ \
  --target http://localhost:8080/v1/chat/completions \
  --rate 50 \
  --duration 30s \
  --scenario confident

Interpreting results

Draft acceptance rate

The fraction of requests where the gateway served the drafter’s response directly. At T=2.0 on the reference dataset, this is 94%. A higher acceptance rate means lower cost per request but more exposure to FN errors (bad drafts served).In production, monitor this with draftthinker_routing_decisions_total{decision="accept"} divided by total routing decisions.

Draft accuracy

The fraction of accepted drafts that the LLM judge rated as acceptable (score ≥ 3). This is TN / (TN + FN). At T=2.0, this is 98.2%.This figure is specific to the benchmark dataset and the judge model. On a domain-specific workload where the drafter is less reliable, accuracy at the same threshold will be lower. Re-calibrate with representative prompts if your workload differs significantly from the general-purpose benchmark set.

Cost reduction

The reduction in estimated token cost compared to routing every request to the heavyweight. At T=2.0, this is 91.6%. Cost is estimated from token counts and per-model pricing (drafter:

0.20/M input,

0.80/M output; heavyweight:

2.50/M input,

10.00/M output).

P99 latency (draft path)

The 99th percentile end-to-end latency for requests that go through the draft path. At 50 req/s with a mock upstream and T=2.0, this is 109ms. This measures gateway processing overhead (entropy window computation, routing decision, HTTP stream handling), not upstream model inference time. Real-world latency includes model response time and network round trips to the API provider.

Proxy overhead

The gateway’s own processing time, excluding upstream model inference. Target is < 5ms P99. The entropy window algorithm is pure in-memory computation on a small sliding window of float64 values — overhead is dominated by I/O, not computation.

The load test always starts an in-process mock OpenAI server and the gateway must be configured to use it as its upstream (via drafter.base_url and heavyweight.base_url pointing at http://localhost:<mock-port>). This isolates gateway overhead from real model latency. The 109 ms P99 figure measures only the gateway’s own processing cost — entropy computation, routing logic, and HTTP stream handling — not real API round-trip time.

Endpoints

Calibration

Dataset

Simple factual

Multi-step reasoning

Code generation

Ambiguous / creative

LLM-as-judge evaluation

How it works

Benchmark results at T=2.0

Tools

Running the full benchmark suite

Interpreting results

Build docs developers (and LLMs) love

Endpoints

Calibration

​Dataset

Simple factual

Multi-step reasoning

Code generation

Ambiguous / creative

​LLM-as-judge evaluation

​How it works

​Benchmark results at T=2.0

​Tools

​Running the full benchmark suite

​Interpreting results

Build docs developers (and LLMs) love

Dataset

LLM-as-judge evaluation

How it works

Benchmark results at T=2.0

Tools

Running the full benchmark suite

Interpreting results