Configuration

Draft Thinker is configured via a YAML file. Pass the path with the --config flag when starting the gateway (defaults to config.yaml).

./draft-thinker --config config.yaml

Default config.yaml

server:
  port: 8080
  read_timeout: 30
  write_timeout: 120
  idle_timeout: 60

drafter:
  provider: openai
  base_url: https://api.openai.com/v1
  model: gpt-4.1-nano
  timeout: 30

heavyweight:
  provider: openai
  base_url: https://api.openai.com/v1
  model: gpt-4.1
  timeout: 60

entropy:
  threshold: 2.0
  window_size: 10
  early_exit_count: 10
  top_logprobs: 5

speculative:
  enabled: true
  soft_threshold_mult: 0.8

cache:
  enabled: true
  similarity_threshold: 0.95
  ttl_seconds: 3600
  embedding_model: text-embedding-3-small
  embedding_dimensions: 1536
  qdrant_collection: draftthinker_cache

metrics:
  enabled: true
  path: /metrics

server

HTTP server settings.

server.port

number

default:"8080"

Port the gateway listens on. The gateway exposes POST /v1/chat/completions on this port.

server.read_timeout

number

default:"30"

Maximum duration in seconds for reading the full request, including the body. Connections that take longer are closed.

server.write_timeout

number

default:"120"

Maximum duration in seconds for writing the response. Set this higher than the heavyweight model’s timeout to avoid cutting off slow responses.

server.idle_timeout

number

default:"60"

Maximum duration in seconds to wait for the next request on a keep-alive connection before closing it.

drafter

Configuration for the fast, cheap model that handles all requests first.

drafter.provider

string

default:"openai"

Model provider. Currently openai is supported. The client uses the OpenAI-compatible API format.

drafter.base_url

string

default:"https://api.openai.com/v1"

Base URL for the drafter’s API. Override to point at a compatible local or third-party endpoint.

drafter.model

string

default:"gpt-4.1-nano"

Model name sent in requests to the drafter. Must support logprobs and top_logprobs parameters for entropy analysis to function.

drafter.timeout

number

default:"30"

Per-request timeout in seconds for the drafter. If the drafter does not complete within this duration, the request is escalated.

heavyweight

Configuration for the frontier model used when the drafter’s entropy exceeds the threshold.

heavyweight.provider

string

default:"openai"

Model provider. Currently openai is supported.

heavyweight.base_url

string

default:"https://api.openai.com/v1"

Base URL for the heavyweight model’s API.

heavyweight.model

string

default:"gpt-4.1"

Model name sent in requests to the heavyweight. This is the escalation target — choose a model with strong reasoning capabilities.

heavyweight.timeout

number

default:"60"

Per-request timeout in seconds for the heavyweight. This should be larger than drafter.timeout because frontier models have higher latency.

entropy

Controls the Shannon entropy algorithm that drives routing decisions.

entropy.threshold

number

default:"2.0"

Calibrated entropy threshold T in bits. If windowed entropy exceeds this value during drafter generation, the request is escalated to the heavyweight. The value 2.0 was determined by sweeping a 518-prompt benchmark dataset and finding the knee of the accuracy-cost curve. Lower values escalate more aggressively (higher accuracy, higher cost); higher values escalate less (lower accuracy, lower cost).

entropy.window_size

number

default:"10"

Number of tokens in the sliding window used to compute windowed average entropy. A window smooths noise from individual uncertain tokens (for example, rare proper nouns) that do not indicate reasoning failure.

entropy.early_exit_count

number

default:"10"

Number of initial tokens to evaluate before triggering an early exit. If the first early_exit_count tokens produce entropy above threshold, the draft is aborted immediately and the request is escalated without completing the draft — avoiding wasted compute.

entropy.top_logprobs

number

default:"5"

Number of top token candidates (with log-probabilities) to request from the drafter per token. Used to compute per-token Shannon entropy: H = -Σ p(x) log₂ p(x). The OpenAI API supports values 0–20.

speculative

Controls speculative execution — the parallel heavyweight pre-fetch that reduces latency on escalated requests.

speculative.enabled

boolean

default:"true"

Enable or disable speculative execution. When enabled, a parallel heavyweight request is fired as soon as early tokens indicate elevated uncertainty. When disabled, draft-then-verify is strictly serial.

speculative.soft_threshold_mult

number

default:"0.8"

Multiplier applied to entropy.threshold to compute the soft threshold. When windowed entropy exceeds soft_threshold_mult × threshold (default: 0.8 × 2.0 = 1.6 bits) during the first tokens, the gateway fires a speculative parallel call to the heavyweight model. If the drafter’s entropy subsequently drops below threshold, the heavyweight call is canceled and the draft is accepted. If entropy stays elevated, the heavyweight response is used and the additional latency is heavyweight_total - drafter_abort_time rather than the full heavyweight latency.

cache

Controls the semantic cache backed by Qdrant (vector store) and Redis (metadata/TTLs).

The cache requires REDIS_URL and QDRANT_URL environment variables at runtime. They default to localhost:6379 and http://localhost:6333 respectively if not set. Only draft-accepted responses are cached — escalated responses indicate drafter uncertainty and are not stored.

cache.enabled

boolean

default:"true"

Enable or disable the semantic cache. When disabled, every request goes through the full draft-verify pipeline.

cache.similarity_threshold

number

default:"0.95"

Minimum cosine similarity between the incoming prompt’s embedding and a cached entry for a cache hit to be returned. The value 0.95 is intentionally conservative to avoid serving stale or semantically drifted responses.

cache.ttl_seconds

number

default:"3600"

Time-to-live in seconds for cached entries. After this duration, entries expire and subsequent similar prompts will go through the draft-verify pipeline again.

cache.embedding_model

string

default:"text-embedding-3-small"

OpenAI embedding model used to convert prompts to vectors. This model is called on every request (for both cache lookup and cache population). Must be consistent with the embedding_dimensions value.

cache.embedding_dimensions

number

default:"1536"

Dimensionality of the embedding vectors. Must match the output dimensions of embedding_model. For text-embedding-3-small, this is 1536.

cache.qdrant_collection

string

default:"draftthinker_cache"

Name of the Qdrant collection used to store and query cached embeddings. The collection is created automatically on first startup if it does not exist.

metrics

Controls Prometheus metrics exposure.

metrics.enabled

boolean

default:"true"

Enable or disable the Prometheus metrics endpoint. When disabled, a no-op recorder is used internally and no metrics are exported.

metrics.path

string

default:"/metrics"

HTTP path on which Prometheus metrics are served. Prometheus is configured by default to scrape this endpoint every 15 seconds.

Environment variables

The following environment variables are read at startup and are not part of config.yaml:

Variable	Required	Default	Description
`OPENAI_API_KEY`	Yes	—	API key for OpenAI. Used for both model calls and embeddings. The gateway exits immediately if this is not set.
`REDIS_URL`	No	`localhost:6379`	Address of the Redis instance used for cache metadata and TTLs.
`QDRANT_URL`	No	`http://localhost:6333`	Base URL of the Qdrant instance used for vector similarity search.

Get Started

How It Works

Deployment

Observability

Default config.yaml

server

drafter

heavyweight

entropy

speculative

cache

metrics

Environment variables

Build docs developers (and LLMs) love

Get Started

How It Works

Deployment

Observability

​Default config.yaml

​server

​drafter

​heavyweight

​entropy

​speculative

​cache

​metrics

​Environment variables

Build docs developers (and LLMs) love

Default config.yaml

server

drafter

heavyweight

entropy

speculative

cache

metrics

Environment variables