Quick Start
Build and run the gateway in under five minutes
Configuration
Configure models, entropy thresholds, cache, and more
How It Works
Understand the entropy-based routing algorithm
API Reference
Full endpoint documentation with request and response schemas
Key capabilities
Entropy-based routing
Computes Shannon entropy over drafter token logprobs in a 10-token sliding window. Routes to heavyweight only when the drafter is genuinely uncertain.
Speculative execution
Fires the heavyweight model in parallel when early tokens show elevated uncertainty. Cancels if the drafter recovers — eliminates serial draft-then-verify latency.
Semantic cache
Embeds incoming prompts and looks up semantically similar past responses via Qdrant (cosine similarity > 0.95). Bypasses the entire inference pipeline on cache hits.
OpenAI-compatible
Implements the same
POST /v1/chat/completions interface. Point your existing OpenAI client at localhost:8080 and it works immediately.Prometheus + Grafana
Ships with 11 custom Prometheus metrics covering throughput, entropy distributions, routing decisions, speculative execution, and cache performance. Pre-built Grafana dashboard included.
Single-command deploy
Docker Compose spins up the gateway, Redis, Qdrant, Prometheus, and Grafana together. One command to a fully instrumented local environment.
Measured results
Calibrated on 518 prompts across four categories (simple factual, multi-step reasoning, code generation, ambiguous/creative) using LLM-as-judge evaluation at threshold T=2.0:| Metric | Result |
|---|---|
| TCO reduction vs all-heavyweight baseline | 91.6% |
| Draft acceptance rate | 94% |
| Draft accuracy (LLM-as-judge) | 98.2% acceptable |
| P99 latency (draft path) | 109ms at 50 req/s |
| Proxy overhead | < 5ms P99 |