Skip to main content
Draft Thinker routes each request by measuring the drafter model’s confidence during generation, not by predicting difficulty beforehand. The mechanism is Shannon entropy computed over token logprobabilities.

The problem with prompt classifiers

A prompt classifier labels requests as “easy” or “hard” before any model touches them. This works on training distribution, but fails in production because questions that look syntactically simple can require complex reasoning depending on context. “What’s the current Fed rate?” is three words. It may also be stale, jurisdiction-specific, or require numerical calculation — none of which the classifier can detect. Entropy-based routing avoids this entirely. Rather than predicting difficulty before generation, it measures actual model confidence during generation. A drafter that is certain produces narrow, peaked token distributions. A drafter that is confused spreads probability mass across many candidates. That signal is available token-by-token and is robust to novel query patterns.

Shannon entropy over logprobs

For each generated token, the drafter returns logprobs for the top-k candidates. The per-token entropy is:
H = -Σ p(x) log₂ p(x)
Low H means the model assigned most probability mass to one token — confident. High H means probability is spread across many alternatives — uncertain. The drafter returns logprobs as natural-log values. ComputeEntropy in internal/entropy/entropy.go converts them to probabilities, normalises, and applies the Shannon formula:
func ComputeEntropy(topLogprobs []float64) float64 {
	if len(topLogprobs) == 0 {
		return 0
	}

	probs := make([]float64, len(topLogprobs))
	sum := 0.0
	for i, lp := range topLogprobs {
		p := math.Exp(lp)
		probs[i] = p
		sum += p
	}

	if sum == 0 {
		return 0
	}

	h := 0.0
	for _, p := range probs {
		p /= sum
		if p > 0 {
			h -= p * math.Log2(p)
		}
	}

	return h
}
The normalisation step (p /= sum) handles cases where the top-k logprobs do not sum to 1.0 — a common property of truncated distributions returned by hosted APIs.

Sliding window smoothing

Individual token entropy is noisy. Rare proper nouns, punctuation, and code identifiers all produce momentary entropy spikes that do not indicate reasoning failure. Computing a routing decision on every token would cause spurious escalations. The solution is a sliding window average over the last 10 tokens (WindowConfig.Size = 10). The window tracks a running sum and uses a circular buffer to evict the oldest value on each addition:
func (w *Window) Add(entropy float64) Decision {
	if w.filled {
		w.sum -= w.buf[w.pos]
	}

	w.buf[w.pos] = entropy
	w.sum += entropy
	w.pos = (w.pos + 1) % w.cfg.Size
	w.count++

	if w.count >= w.cfg.Size {
		w.filled = true
	}

	if w.count <= w.cfg.EarlyExitCount && entropy > w.cfg.Threshold {
		return Escalate
	}

	if w.filled && w.Average() > w.cfg.Threshold {
		return Escalate
	}

	return Continue
}
A routing decision of Escalate is returned only when the windowed average exceeds the calibrated threshold T. Before the window fills, the early-exit path applies (see below).

Early exit

If any of the first EarlyExitCount tokens individually exceed T, the drafter is aborted immediately. There is no point completing a response that will be discarded. The router drains the drafter channel in a background goroutine to avoid blocking the upstream connection:
if decision == entropy.Escalate {
    result.Decision = entropy.Escalate
    go func() {
        for range chunks {
        }
    }()
    return result, nil
}

Routing decisions

The router produces one of two outcomes, recorded in the draftthinker_routing_decisions_total counter under the decision label:
DecisionMeaning
acceptWindowed entropy stayed below T for the full response. Draft is served to the client.
escalateWindowed entropy exceeded T (or early exit fired). Request is forwarded to the heavyweight model.

Calibrated threshold T = 2.0

T is not a guess — it is selected empirically. The sweep tool in benchmarks/cmd/sweep/ runs the drafter across a labelled benchmark set at each candidate threshold and records escalation rate, draft accuracy, cost reduction, and F1. Calibration results — 518 prompts, gpt-4.1-nano drafter, gpt-4.1 heavyweight:
ThresholdEscalation rateDraft accuracyCost reductionF1
1.0068.9%100.0%8.2%0.06
1.2549.2%99.6%31.0%0.08
1.5030.9%98.6%56.2%0.07
1.7513.9%98.4%81.2%0.10
2.006.0%98.2%91.6%0.10
2.250.4%97.9%99.0%0.00
2.500.0%97.9%99.2%0.00
The sweep tool selects the threshold with the highest F1 score among those where draft accuracy is at or above 95%. T = 2.0 achieves 94% draft acceptance, 98.2% accuracy, and 91.6% cost reduction versus an all-heavyweight baseline. At T = 2.25 and above, the escalation rate collapses to near zero — the threshold is so permissive that almost nothing escalates, F1 drops to 0.00, and accuracy gains are marginal. T = 2.0 is the last point where the router is still meaningfully discriminating.

Known failure mode: confident hallucination

The drafter can produce a confidently wrong answer. If the model assigns high probability to an incorrect token sequence — for example, a plausible but outdated fact — the window entropy stays low and the routing decision is accept. The response is served without escalation.This is the fundamental limitation of entropy-based routing. Mitigations in Draft Thinker:
  1. Periodic accuracy audits on a sample of draft-accepted responses using an LLM-as-judge evaluation
  2. Downstream feedback loop — clients can flag bad responses for manual review and cache eviction
  3. Conservative threshold — T = 2.0 errs toward escalation rather than acceptance at the calibration boundary
  4. Accepted tradeoff — the system optimizes cost, not error elimination; escalation is a mechanism to catch uncertain responses, not incorrect ones

Build docs developers (and LLMs) love