Skip to main content
Draft Thinker’s calibration results and performance numbers are generated from a reproducible benchmark suite. Every figure in the README and the calibration table is backed by a run of these tools.

Dataset

The benchmark dataset contains 518 prompts across four categories, stored in benchmarks/testdata/prompts.json.

Simple factual

Factual questions with a single, unambiguous correct answer. The drafter is expected to handle these confidently. Low entropy, high acceptance rate.

Multi-step reasoning

Questions requiring multi-step inference, arithmetic, or logical chaining. More likely to trigger uncertainty in the drafter.

Code generation

Requests to write, complete, or debug code. Drafter confidence varies widely depending on problem complexity.

Ambiguous / creative

Open-ended prompts with no single correct answer. Tests whether the gateway handles uncertainty in creative tasks without over-escalating.
The dataset is the input to the collect tool, which sends each prompt to both the drafter and heavyweight model and records the per-token logprob sequences and judge verdicts.

LLM-as-judge evaluation

The benchmark uses an LLM to evaluate whether each draft response is acceptable. This avoids the need for hand-labeled ground truth and scales to any prompt domain.

How it works

The collect tool calls gpt-4.1 (the heavyweight model) as the judge. For each prompt, the judge receives:
  1. The original prompt
  2. The drafter’s response
  3. A reference response generated by the heavyweight itself
The judge scores the draft on three criteria:
  • Factual accuracy compared to the reference
  • Completeness of the answer
  • Whether it would be acceptable to serve to a user
The judge returns a score from 1 to 5 and a boolean acceptable field. A score of 3 or above is treated as acceptable.
1 = completely wrong or incoherent
2 = partially correct but missing critical information
3 = mostly correct, acceptable quality
4 = good quality, minor differences from reference
5 = equivalent or better than reference
The acceptable field from the judge is what the sweep uses as ground truth when computing the confusion matrix. See Threshold calibration for how those labels map to TP, TN, FP, and FN.
LLM-as-judge evaluation is not perfect. The judge can disagree with human raters, especially on ambiguous or creative prompts. The calibration results should be treated as indicative rather than ground truth for your specific workload. Re-calibrate with your own prompts if you need domain-specific accuracy guarantees.

Benchmark results at T=2.0

These are the measured results from the reference calibration run at the selected threshold T=2.0.
MetricValue
Draft acceptance rate94%
Draft accuracy (LLM-as-judge)98.2% acceptable
Cost reduction vs. all-heavyweight91.6%
P99 latency (draft path)109ms at 50 req/s
Proxy overhead< 5ms P99
The latency figures come from the load test tool described below. The accuracy and cost figures come from the threshold sweep.

Tools

The benchmark suite has three tools under benchmarks/cmd/:
Sends every prompt in the dataset to both the drafter and heavyweight, runs LLM-as-judge evaluation, and writes results to a JSONL file.
go run ./benchmarks/cmd/collect/ \
  --prompts benchmarks/testdata/prompts.json \
  --output benchmarks/results/collected.jsonl \
  --concurrency 3 \
  --rate-delay 500ms \
  --top-logprobs 5 \
  --max-tokens 1024
Flags
FlagDefaultDescription
--promptsbenchmarks/testdata/prompts.jsonPath to the prompt dataset
--outputbenchmarks/results/collected.jsonlOutput JSONL path
--concurrency3Maximum concurrent API calls
--rate-delay500msDelay between prompt dispatches
--top-logprobs5Number of top logprobs to request per token
--max-tokens1024Maximum tokens per completion
Requires OPENAI_API_KEY to be set in the environment. The tool is resumable: prompts already present in the output file (matched by ID) are skipped.The output JSONL has one record per prompt. Each record includes the per-token entropy values from the drafter, the draft and heavyweight responses, token counts, latency measurements, and the judge result.
Replays the collected JSONL offline at each candidate threshold, computing routing decisions and accuracy/cost metrics without making any API calls.
go run ./benchmarks/cmd/sweep/ \
  --input benchmarks/results/collected.jsonl \
  --output benchmarks/results/sweep.csv \
  --thresholds 0.5,0.75,1.0,1.25,1.5,1.75,2.0,2.25,2.5 \
  --window-size 10 \
  --early-exit-count 10
Flags
FlagDefaultDescription
--inputbenchmarks/results/collected.jsonlCollected JSONL input path
--outputbenchmarks/results/sweep.csvSweep CSV output path
--thresholds0.5,0.75,1.0,1.25,1.5Comma-separated threshold values to sweep
--window-size10Entropy window size (number of tokens)
--early-exit-count10Number of tokens above threshold before escalating
Prints a human-readable summary table to stdout and writes the full per-threshold metrics (including raw TP, FP, TN, FN counts) to the CSV file. The sweep auto-selects the threshold with the highest F1 score where draft accuracy ≥ 95% and prints it at the end of the summary.
Runs a load test against the gateway using vegeta. Starts an in-process mock OpenAI server that serves synthetic streaming responses with configurable logprobs, then drives traffic against the real gateway at a specified rate.
go run ./benchmarks/cmd/loadtest/ \
  --target http://localhost:8080/v1/chat/completions \
  --rate 50 \
  --duration 30s \
  --scenario confident \
  --tokens 20 \
  --chunk-delay 5ms
Flags
FlagDefaultDescription
--targethttp://localhost:8080/v1/chat/completionsGateway endpoint to test
--rate50Requests per second
--duration30sTest duration
--mock-port9999Port for the in-process mock OpenAI server
--scenarioconfidentTest scenario: confident, mixed, or cache
--tokens20Tokens per mock response
--chunk-delay5msDelay between SSE chunks in the mock server
Scenarios
ScenarioBehaviour
confidentAll mock responses have low-entropy logprobs (95% probability on the top token). The gateway accepts every draft.
mixed80% of responses are confident, 20% are uncertain (near-uniform logprobs). The gateway escalates ~20% of requests.
cacheAll requests send the same fixed prompt ("What is 2+2?"). After the first request, the semantic cache should serve subsequent requests directly.
The tool prints P50, P95, P99, and Max latencies, success rate, and status code counts on completion.The P99 latency of 109ms at 50 req/s quoted in the README was measured with the confident scenario, 20 tokens per response, 5ms chunk delay, and a 30-second test duration. This measures the full draft path through the gateway with the mock upstream providing instant synthetic responses — it captures gateway overhead (entropy analysis, routing logic, HTTP handling) without real model inference latency.

Running the full benchmark suite

1

Start the gateway and infrastructure

The load test requires a running gateway. Start the infrastructure and gateway first.
docker compose up -d
export OPENAI_API_KEY=sk-...
go build -o draft-thinker ./cmd/gateway
./draft-thinker --config config.yaml
2

Collect token records

Run the collector. This step calls the real OpenAI API and takes time proportional to dataset size and --concurrency. With the default 518-prompt dataset and --concurrency 3, expect 20-40 minutes.
go run ./benchmarks/cmd/collect/ \
  --prompts benchmarks/testdata/prompts.json \
  --output benchmarks/results/collected.jsonl \
  --concurrency 3
3

Run the threshold sweep

Sweep candidate thresholds over the collected data. This is fast — no API calls, pure computation.
go run ./benchmarks/cmd/sweep/ \
  --input benchmarks/results/collected.jsonl \
  --output benchmarks/results/sweep.csv \
  --thresholds 1.0,1.25,1.5,1.75,2.0,2.25,2.5
The selected threshold is printed to stdout. Update entropy.threshold in config.yaml and restart the gateway if the selected value differs from the current one.
4

Run the load test

Run the load test against the running gateway with the confident scenario to measure draft-path latency.
go run ./benchmarks/cmd/loadtest/ \
  --target http://localhost:8080/v1/chat/completions \
  --rate 50 \
  --duration 30s \
  --scenario confident

Interpreting results

The fraction of requests where the gateway served the drafter’s response directly. At T=2.0 on the reference dataset, this is 94%. A higher acceptance rate means lower cost per request but more exposure to FN errors (bad drafts served).In production, monitor this with draftthinker_routing_decisions_total{decision="accept"} divided by total routing decisions.
The fraction of accepted drafts that the LLM judge rated as acceptable (score ≥ 3). This is TN / (TN + FN). At T=2.0, this is 98.2%.This figure is specific to the benchmark dataset and the judge model. On a domain-specific workload where the drafter is less reliable, accuracy at the same threshold will be lower. Re-calibrate with representative prompts if your workload differs significantly from the general-purpose benchmark set.
The reduction in estimated token cost compared to routing every request to the heavyweight. At T=2.0, this is 91.6%. Cost is estimated from token counts and per-model pricing (drafter: 0.20/Minput,0.20/M input, 0.80/M output; heavyweight: 2.50/Minput,2.50/M input, 10.00/M output).
The 99th percentile end-to-end latency for requests that go through the draft path. At 50 req/s with a mock upstream and T=2.0, this is 109ms. This measures gateway processing overhead (entropy window computation, routing decision, HTTP stream handling), not upstream model inference time. Real-world latency includes model response time and network round trips to the API provider.
The gateway’s own processing time, excluding upstream model inference. Target is < 5ms P99. The entropy window algorithm is pure in-memory computation on a small sliding window of float64 values — overhead is dominated by I/O, not computation.
The load test always starts an in-process mock OpenAI server and the gateway must be configured to use it as its upstream (via drafter.base_url and heavyweight.base_url pointing at http://localhost:<mock-port>). This isolates gateway overhead from real model latency. The 109 ms P99 figure measures only the gateway’s own processing cost — entropy computation, routing logic, and HTTP stream handling — not real API round-trip time.

Build docs developers (and LLMs) love