Skip to main content

What is Draft Thinker?

Draft Thinker is a cost-aware LLM gateway written in Go. It sits between your application and LLM providers, routing each request through a fast, cheap model first and only escalating to an expensive frontier model when necessary. The result: 91.6% total cost of ownership (TCO) reduction compared to sending all traffic to a heavyweight model, with 98.2% accuracy on the draft path.

The problem it solves

LLM-powered applications typically send 100% of traffic to frontier models regardless of query complexity. A question like “What are your hours?” costs the same as “Explain the tradeoffs between B-tree and LSM-tree storage engines.” This is wasteful in three ways:
  • Cost: 70%+ of queries are answerable by models costing 10–50x less.
  • Latency: Frontier models have 2–5x higher time-to-first-token than small models.
  • Scale: At high throughput, frontier model rate limits become the bottleneck, not your application.
The hard part is knowing when the cheap model is good enough — without already having the correct answer. Prompt classifiers that predict difficulty before generation fail on distribution shift: a syntactically simple question can require complex reasoning depending on context.

The core insight

Draft Thinker solves this by analyzing the drafter model’s own confidence signals during generation. Every token a model produces comes with log-probabilities for its top candidates. High entropy (uncertainty) across those candidates means the model is guessing. Low entropy means it’s confident. The gateway watches these signals in real time as the drafter generates. If confidence stays high throughout, it ships the draft. If confidence drops, it escalates to the heavyweight. This makes routing decisions based on actual model behavior, not predicted query difficulty.

Three core mechanisms

Entropy-based routing

Computes Shannon entropy over the drafter’s token log-probabilities using a sliding window of 10 tokens. If windowed entropy exceeds the calibrated threshold T=2.0 bits at any point, the request is escalated. If the first 10 tokens already exceed T, the draft is aborted immediately to avoid wasting compute.

Speculative execution

When early tokens show elevated but not yet critical uncertainty (entropy > 0.8 × T), Draft Thinker fires a parallel request to the heavyweight model. If the drafter recovers, the heavyweight call is canceled. If not, the heavyweight already has a head start — eliminating the full double-latency penalty of naive serial draft-then-verify.

Semantic cache

Previously verified prompt–response pairs are stored as embeddings in Qdrant. If an incoming prompt is semantically similar (cosine similarity > 0.95) to a cached entry, the response is returned directly — bypassing the entire draft-verify cycle. Only draft-accepted responses are cached; escalated responses are not.

OpenAI-compatible API

The gateway exposes a POST /v1/chat/completions endpoint that is a drop-in replacement for the OpenAI API. The model field in the request is overridden internally; your application does not need to know which model handled the request.

Key results

Calibrated on 518 prompts across four categories — simple factual, multi-step reasoning, code generation, and ambiguous/creative — using LLM-as-judge evaluation:
MetricValue
TCO reduction vs. all-heavyweight91.6% (at T=2.0)
Draft acceptance rate94% of requests served by drafter
Accuracy on draft path98.2% acceptable (LLM-as-judge)
P99 latency (draft path)109 ms at 50 req/s
Proxy overhead< 5 ms P99
Calibrated thresholdT = 2.0 (Shannon entropy in bits, 10-token sliding window)

Tech stack

ComponentTechnology
GatewayGo net/http — goroutines for concurrent I/O, no framework overhead
Entropy engineGo math — pure math, no cross-language boundary
Drafter modelOpenAI gpt-4.1-nano — fast, cheap, returns logprobs
Heavyweight modelOpenAI gpt-4.1 — escalation target
Vector cacheQdrant — nearest-neighbor lookup for semantic cache
KV storeRedis — TTLs, metadata, rate counters
ObservabilityPrometheus + Grafana — cost/request, entropy distributions, cache hit rate
DeploymentDocker Compose — single command spins up all services
No Python is in the hot path. The draft-verify state machine is a Go switch statement. Cross-language IPC would add latency that contradicts the project’s core value proposition.
Known limitation: confident hallucination. The drafter can produce a wrong answer with low entropy — meaning the routing decision is “accept” but the output is incorrect. This is the fundamental limitation of entropy-based routing. It is mitigated by periodic accuracy audits, downstream feedback loops, and a conservative initial threshold. It is a documented tradeoff, not a bug.

Architecture overview

Client


┌─────────────────────────────────────────────────┐
│                  GATEWAY (Go)                   │
│                                                 │
│  ┌───────────┐   ┌──────────┐   ┌───────────┐  │
│  │  Ingress  │──▶│ Semantic │──▶│  Drafter   │  │
│  │  (Auth,   │   │  Cache   │   │   Pool     │  │
│  │  Validate)│   │ (Qdrant) │   │  (OpenAI)  │  │
│  └───────────┘   └────┬─────┘   └─────┬──────┘  │
│                  HIT   │              │ tokens   │
│                   │    │         ┌────▼───────┐  │
│                   │    │         │  Entropy   │  │
│                   │    │         │  Analyzer  │  │
│                   │    │         └────┬───────┘  │
│                   │    │     LOW ┌────┴────┐HIGH │
│                   │    │         │         │     │
│                   ▼    │         ▼         ▼     │
│              ┌────────────┐  ACCEPT   ESCALATE   │
│              │  Response  │◀───┘    ┌────▼────┐  │
│              │  (cached)  │         │Heavy API│  │
│              └─────┬──────┘         │(OpenAI) │  │
│                    │                └────┬────┘  │
│                    ▼                     ▼       │
│              ┌──────────────────────────────┐    │
│              │         Response             │    │
│              └──────────────────────────────┘    │
└─────────────────────────────────────────────────┘


Client

Next steps

Ready to run Draft Thinker locally?

Quick start

Get Draft Thinker running in under five minutes.

Build docs developers (and LLMs) love