--config flag when starting the gateway (defaults to config.yaml).
Default config.yaml
server
HTTP server settings.Port the gateway listens on. The gateway exposes
POST /v1/chat/completions on this port.Maximum duration in seconds for reading the full request, including the body. Connections that take longer are closed.
Maximum duration in seconds for writing the response. Set this higher than the heavyweight model’s timeout to avoid cutting off slow responses.
Maximum duration in seconds to wait for the next request on a keep-alive connection before closing it.
drafter
Configuration for the fast, cheap model that handles all requests first.Model provider. Currently
openai is supported. The client uses the OpenAI-compatible API format.Base URL for the drafter’s API. Override to point at a compatible local or third-party endpoint.
Model name sent in requests to the drafter. Must support
logprobs and top_logprobs parameters for entropy analysis to function.Per-request timeout in seconds for the drafter. If the drafter does not complete within this duration, the request is escalated.
heavyweight
Configuration for the frontier model used when the drafter’s entropy exceeds the threshold.Model provider. Currently
openai is supported.Base URL for the heavyweight model’s API.
Model name sent in requests to the heavyweight. This is the escalation target — choose a model with strong reasoning capabilities.
Per-request timeout in seconds for the heavyweight. This should be larger than
drafter.timeout because frontier models have higher latency.entropy
Controls the Shannon entropy algorithm that drives routing decisions.Calibrated entropy threshold
T in bits. If windowed entropy exceeds this value during drafter generation, the request is escalated to the heavyweight. The value 2.0 was determined by sweeping a 518-prompt benchmark dataset and finding the knee of the accuracy-cost curve. Lower values escalate more aggressively (higher accuracy, higher cost); higher values escalate less (lower accuracy, lower cost).Number of tokens in the sliding window used to compute windowed average entropy. A window smooths noise from individual uncertain tokens (for example, rare proper nouns) that do not indicate reasoning failure.
Number of initial tokens to evaluate before triggering an early exit. If the first
early_exit_count tokens produce entropy above threshold, the draft is aborted immediately and the request is escalated without completing the draft — avoiding wasted compute.Number of top token candidates (with log-probabilities) to request from the drafter per token. Used to compute per-token Shannon entropy:
H = -Σ p(x) log₂ p(x). The OpenAI API supports values 0–20.speculative
Controls speculative execution — the parallel heavyweight pre-fetch that reduces latency on escalated requests.Enable or disable speculative execution. When enabled, a parallel heavyweight request is fired as soon as early tokens indicate elevated uncertainty. When disabled, draft-then-verify is strictly serial.
Multiplier applied to
entropy.threshold to compute the soft threshold. When windowed entropy exceeds soft_threshold_mult × threshold (default: 0.8 × 2.0 = 1.6 bits) during the first tokens, the gateway fires a speculative parallel call to the heavyweight model. If the drafter’s entropy subsequently drops below threshold, the heavyweight call is canceled and the draft is accepted. If entropy stays elevated, the heavyweight response is used and the additional latency is heavyweight_total - drafter_abort_time rather than the full heavyweight latency.cache
Controls the semantic cache backed by Qdrant (vector store) and Redis (metadata/TTLs).The cache requires
REDIS_URL and QDRANT_URL environment variables at runtime. They default to localhost:6379 and http://localhost:6333 respectively if not set. Only draft-accepted responses are cached — escalated responses indicate drafter uncertainty and are not stored.Enable or disable the semantic cache. When disabled, every request goes through the full draft-verify pipeline.
Minimum cosine similarity between the incoming prompt’s embedding and a cached entry for a cache hit to be returned. The value
0.95 is intentionally conservative to avoid serving stale or semantically drifted responses.Time-to-live in seconds for cached entries. After this duration, entries expire and subsequent similar prompts will go through the draft-verify pipeline again.
OpenAI embedding model used to convert prompts to vectors. This model is called on every request (for both cache lookup and cache population). Must be consistent with the
embedding_dimensions value.Dimensionality of the embedding vectors. Must match the output dimensions of
embedding_model. For text-embedding-3-small, this is 1536.Name of the Qdrant collection used to store and query cached embeddings. The collection is created automatically on first startup if it does not exist.
metrics
Controls Prometheus metrics exposure.Enable or disable the Prometheus metrics endpoint. When disabled, a no-op recorder is used internally and no metrics are exported.
HTTP path on which Prometheus metrics are served. Prometheus is configured by default to scrape this endpoint every 15 seconds.
Environment variables
The following environment variables are read at startup and are not part ofconfig.yaml:
| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY | Yes | — | API key for OpenAI. Used for both model calls and embeddings. The gateway exits immediately if this is not set. |
REDIS_URL | No | localhost:6379 | Address of the Redis instance used for cache metadata and TTLs. |
QDRANT_URL | No | http://localhost:6333 | Base URL of the Qdrant instance used for vector similarity search. |