Draft Thinker

Draft Thinker is a Go gateway that sits between your application and LLM providers. It reduces inference costs by 91.6% by routing requests to a fast, cheap drafter model first, analyzing token-level confidence via Shannon entropy, and only escalating to expensive frontier models when the drafter demonstrates real uncertainty. Drop it in front of any OpenAI-compatible client — no application changes required.

Quick Start

Build and run the gateway in under five minutes

Configuration

Configure models, entropy thresholds, cache, and more

How It Works

Understand the entropy-based routing algorithm

API Reference

Full endpoint documentation with request and response schemas

Key capabilities

Entropy-based routing

Computes Shannon entropy over drafter token logprobs in a 10-token sliding window. Routes to heavyweight only when the drafter is genuinely uncertain.

Speculative execution

Fires the heavyweight model in parallel when early tokens show elevated uncertainty. Cancels if the drafter recovers — eliminates serial draft-then-verify latency.

Semantic cache

Embeds incoming prompts and looks up semantically similar past responses via Qdrant (cosine similarity > 0.95). Bypasses the entire inference pipeline on cache hits.

OpenAI-compatible

Implements the same POST /v1/chat/completions interface. Point your existing OpenAI client at localhost:8080 and it works immediately.

Prometheus + Grafana

Ships with 11 custom Prometheus metrics covering throughput, entropy distributions, routing decisions, speculative execution, and cache performance. Pre-built Grafana dashboard included.

Single-command deploy

Docker Compose spins up the gateway, Redis, Qdrant, Prometheus, and Grafana together. One command to a fully instrumented local environment.

Measured results

Calibrated on 518 prompts across four categories (simple factual, multi-step reasoning, code generation, ambiguous/creative) using LLM-as-judge evaluation at threshold T=2.0:

Metric	Result
TCO reduction vs all-heavyweight baseline	91.6%
Draft acceptance rate	94%
Draft accuracy (LLM-as-judge)	98.2% acceptable
P99 latency (draft path)	109ms at 50 req/s
Proxy overhead	< 5ms P99

How to get started

Build the gateway

Clone the repo and build the binary with Go 1.22+.

git clone https://github.com/trnahnh/draft-thinker.git
cd draft-thinker
go build -o draft-thinker ./cmd/gateway

Start infrastructure

Spin up Redis, Qdrant, Prometheus, and Grafana with Docker Compose.

docker compose up -d

Run the gateway

Export your OpenAI API key and start the gateway with the default config.

export OPENAI_API_KEY=sk-...
./draft-thinker --config config.yaml

Send your first request

Use the OpenAI-compatible endpoint — the gateway handles routing transparently.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is 2+2?"}]}'

Check the X-Routing-Decision response header to see whether the request was served by the drafter, escalated, or returned from cache.

Get Started

How It Works

Deployment

Observability

Quick Start

Configuration

How It Works

API Reference

Key capabilities

Entropy-based routing

Speculative execution

Semantic cache

OpenAI-compatible

Prometheus + Grafana

Single-command deploy

Measured results

How to get started

Build docs developers (and LLMs) love

Get Started

How It Works

Deployment

Observability

Quick Start

Configuration

How It Works

API Reference

​Key capabilities

Entropy-based routing

Speculative execution

Semantic cache

OpenAI-compatible

Prometheus + Grafana

Single-command deploy

​Measured results

​How to get started

Build docs developers (and LLMs) love

Key capabilities

Measured results

How to get started