Skip to main content
This guide walks you through building the gateway, starting the supporting infrastructure, and sending your first request.

Prerequisites

  • Go 1.22 or later
  • Docker and Docker Compose
  • An OpenAI API key with access to gpt-4.1-nano and gpt-4.1

Get up and running

1

Clone and build

Clone the repository and compile the gateway binary.
git clone https://github.com/trnahnh/draft-thinker.git
cd draft-thinker
go build -o draft-thinker ./cmd/gateway
2

Start infrastructure

Docker Compose starts Redis, Qdrant, Prometheus, and Grafana in the background.
docker compose up -d
ServicePortPurpose
Redis6379Cache metadata and TTLs
Qdrant6333 (HTTP), 6334 (gRPC)Vector similarity store
Prometheus9090Metrics scraper
Grafana3000Dashboards
Qdrant persists data to a named Docker volume (qdrant_data). Cache entries survive container restarts.
3

Set your API key

The gateway requires your OpenAI API key at startup. Export it in the same shell where you’ll run the binary.
export OPENAI_API_KEY=sk-...
The gateway refuses to start without OPENAI_API_KEY. Redis and Qdrant URLs default to localhost:6379 and http://localhost:6333 when running outside Docker Compose.
4

Run the gateway

Start the gateway using the default configuration file. It listens on port 8080.
./draft-thinker --config config.yaml
You should see:
2024/01/01 00:00:00 semantic cache enabled (threshold=0.95, ttl=3600s)
2024/01/01 00:00:00 starting gateway on :8080
5

Send a request

The gateway implements the same interface as OpenAI’s chat completions endpoint. Point any OpenAI-compatible client at http://localhost:8080.
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }'
The model field is ignored — routing is determined entirely by the entropy analysis.
6

Inspect routing decisions

Every response includes headers showing how the request was routed and how long it took.
curl -si http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }' | grep -E "X-Routing|X-Request"
Example output:
X-Routing-Decision: accept
X-Request-Duration-Ms: 312
X-Request-ID: a3f2b8c1d4e5f607
The X-Routing-Decision header takes one of three values:
  • accept — the drafter’s response was served directly (low entropy, confident answer)
  • escalate — entropy exceeded the threshold; the heavyweight model responded
  • cache_hit — a semantically similar prompt was found in cache; no model was called
7

View the Grafana dashboard

Open http://localhost:3000 in your browser. Log in with admin / admin.The pre-built dashboard shows request rates, routing decision breakdown, latency percentiles, entropy distribution, and cache hit rate — all updating in real time as you send requests.
Prometheus scrapes the gateway at http://gateway:8080/metrics every 15 seconds. Give it a few seconds after your first request before data appears in the dashboard.

Try a harder request

Send a multi-step reasoning question to see escalation in action:
curl -si http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "Explain the tradeoffs between B-tree and LSM-tree storage engines in the context of write-heavy workloads with infrequent reads."}
    ]
  }' | grep "X-Routing-Decision"
This type of query typically produces higher token-level entropy in the drafter, causing escalation to the heavyweight model. You should see X-Routing-Decision: escalate.

Next steps

Configuration

Tune entropy threshold, speculative execution, and cache settings

How It Works

Deep dive into the entropy routing algorithm

API Reference

Full endpoint documentation and request schemas

Metrics reference

All Prometheus metrics exposed by the gateway

Build docs developers (and LLMs) love