Prerequisites
- Go 1.22 or later
- Docker and Docker Compose
- An OpenAI API key with access to
gpt-4.1-nanoandgpt-4.1
Get up and running
Start infrastructure
Docker Compose starts Redis, Qdrant, Prometheus, and Grafana in the background.
| Service | Port | Purpose |
|---|---|---|
| Redis | 6379 | Cache metadata and TTLs |
| Qdrant | 6333 (HTTP), 6334 (gRPC) | Vector similarity store |
| Prometheus | 9090 | Metrics scraper |
| Grafana | 3000 | Dashboards |
Set your API key
The gateway requires your OpenAI API key at startup. Export it in the same shell where you’ll run the binary.
Run the gateway
Start the gateway using the default configuration file. It listens on port 8080.You should see:
Send a request
The gateway implements the same interface as OpenAI’s chat completions endpoint. Point any OpenAI-compatible client at The
http://localhost:8080.model field is ignored — routing is determined entirely by the entropy analysis.Inspect routing decisions
Every response includes headers showing how the request was routed and how long it took.Example output:
View the Grafana dashboard
Open
http://localhost:3000 in your browser. Log in with admin / admin.The pre-built dashboard shows request rates, routing decision breakdown, latency percentiles, entropy distribution, and cache hit rate — all updating in real time as you send requests.Prometheus scrapes the gateway at
http://gateway:8080/metrics every 15 seconds. Give it a few seconds after your first request before data appears in the dashboard.Try a harder request
Send a multi-step reasoning question to see escalation in action:X-Routing-Decision: escalate.
Next steps
Configuration
Tune entropy threshold, speculative execution, and cache settings
How It Works
Deep dive into the entropy routing algorithm
API Reference
Full endpoint documentation and request schemas
Metrics reference
All Prometheus metrics exposed by the gateway