Quick Start

This guide walks you through building the gateway, starting the supporting infrastructure, and sending your first request.

Prerequisites

Go 1.22 or later
Docker and Docker Compose
An OpenAI API key with access to gpt-4.1-nano and gpt-4.1

Get up and running

Clone and build

Clone the repository and compile the gateway binary.

git clone https://github.com/trnahnh/draft-thinker.git
cd draft-thinker
go build -o draft-thinker ./cmd/gateway

Start infrastructure

Docker Compose starts Redis, Qdrant, Prometheus, and Grafana in the background.

docker compose up -d

Service	Port	Purpose
Redis	6379	Cache metadata and TTLs
Qdrant	6333 (HTTP), 6334 (gRPC)	Vector similarity store
Prometheus	9090	Metrics scraper
Grafana	3000	Dashboards

Qdrant persists data to a named Docker volume (qdrant_data). Cache entries survive container restarts.

Set your API key

The gateway requires your OpenAI API key at startup. Export it in the same shell where you’ll run the binary.

export OPENAI_API_KEY=sk-...

The gateway refuses to start without OPENAI_API_KEY. Redis and Qdrant URLs default to localhost:6379 and http://localhost:6333 when running outside Docker Compose.

Run the gateway

Start the gateway using the default configuration file. It listens on port 8080.

./draft-thinker --config config.yaml

You should see:

2024/01/01 00:00:00 semantic cache enabled (threshold=0.95, ttl=3600s)
2024/01/01 00:00:00 starting gateway on :8080

Send a request

The gateway implements the same interface as OpenAI’s chat completions endpoint. Point any OpenAI-compatible client at http://localhost:8080.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }'

The model field is ignored — routing is determined entirely by the entropy analysis.

Inspect routing decisions

Every response includes headers showing how the request was routed and how long it took.

curl -si http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }' | grep -E "X-Routing|X-Request"

Example output:

X-Routing-Decision: accept
X-Request-Duration-Ms: 312
X-Request-ID: a3f2b8c1d4e5f607

The X-Routing-Decision header takes one of three values:

accept — the drafter’s response was served directly (low entropy, confident answer)
escalate — entropy exceeded the threshold; the heavyweight model responded
cache_hit — a semantically similar prompt was found in cache; no model was called

View the Grafana dashboard

Open http://localhost:3000 in your browser. Log in with admin / admin.The pre-built dashboard shows request rates, routing decision breakdown, latency percentiles, entropy distribution, and cache hit rate — all updating in real time as you send requests.

Prometheus scrapes the gateway at http://gateway:8080/metrics every 15 seconds. Give it a few seconds after your first request before data appears in the dashboard.

Try a harder request

Send a multi-step reasoning question to see escalation in action:

curl -si http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "Explain the tradeoffs between B-tree and LSM-tree storage engines in the context of write-heavy workloads with infrequent reads."}
    ]
  }' | grep "X-Routing-Decision"

This type of query typically produces higher token-level entropy in the drafter, causing escalation to the heavyweight model. You should see X-Routing-Decision: escalate.

Next steps

Configuration

Tune entropy threshold, speculative execution, and cache settings

How It Works

Deep dive into the entropy routing algorithm

API Reference

Full endpoint documentation and request schemas

Metrics reference

All Prometheus metrics exposed by the gateway

Get Started

How It Works

Deployment

Observability

Prerequisites

Get up and running

Try a harder request

Next steps

Configuration

How It Works

API Reference

Metrics reference

Build docs developers (and LLMs) love

Get Started

How It Works

Deployment

Observability

​Prerequisites

​Get up and running

​Try a harder request

​Next steps

Configuration

How It Works

API Reference

Metrics reference

Build docs developers (and LLMs) love

Prerequisites

Get up and running

Try a harder request

Next steps