/metrics by default. The path is configurable via metrics.path in config.yaml.
config.yaml
draftthinker_ prefix.
Phase 1 — Foundation metrics
These three instruments measure baseline gateway behavior: total throughput, upstream response time, and error volume.Total number of requests processed by the gateway, partitioned by model name and HTTP status code.Labels
| Label | Values |
|---|---|
model | Model name string (e.g. gpt-4.1-nano, gpt-4.1) |
status | HTTP status code as a string (e.g. 200, 429, 500) |
End-to-end latency for upstream LLM provider calls, in seconds. Observe this per provider to compare drafter vs. heavyweight response times.Labels
Buckets (seconds)
| Label | Values |
|---|---|
provider | drafter or heavyweight |
Total number of errors, partitioned by error type.Labels
| Label | Values |
|---|---|
type | invalid_request, routing_error, upstream_error, upstream_timeout, stream_error, internal_error |
Phase 2 — Entropy routing metrics
These instruments track the entropy-based routing engine added in Phase 2. Entropy is computed as per-token Shannon entropy in bits; requests above the configured threshold are escalated to the heavyweight model.Distribution of per-token Shannon entropy values observed during drafter inference, measured in bits. Use this histogram to understand how requests distribute across the entropy scale and to validate your escalation threshold.Labels: noneBuckets (bits)
Total routing decisions made by the gateway, partitioned by outcome. Phase 5 added a third decision value (
cache_hit).Labels| Label | Value | Meaning |
|---|---|---|
decision | accept | Drafter response served directly |
decision | escalate | Request forwarded to heavyweight model |
decision | cache_hit | Response served from semantic cache (Phase 5) |
Phase 4 — Speculative execution metrics
Speculative execution fires a parallel heavyweight call when the drafter’s entropy crosses a soft threshold, giving the heavyweight a head start before the hard escalation threshold is reached. These three instruments track when that mechanism engages, how often it is wasted, and how much latency it saves.Total number of times the gateway fired a speculative heavyweight call in parallel (soft threshold exceeded).Labels: none
Total number of speculative heavyweight calls that were cancelled because the drafter recovered before the hard escalation threshold was reached. A high cancellation ratio indicates the soft threshold may be set too aggressively.Labels: none
The head-start latency saved on escalated requests that had a running speculative heavyweight call, in seconds. Higher values indicate more user-facing latency was eliminated by speculative execution.Labels: noneBuckets (seconds)
Phase 5 — Cache metrics
The semantic cache layer introduced in Phase 5 uses embedding similarity to serve repeated or near-duplicate prompts without hitting either upstream model. These instruments measure cache effectiveness and lookup cost.Total number of cache hits: a semantically similar prompt was found in the vector index and the cached response was returned.Labels: none
Total number of cache misses: no sufficiently similar prompt was found, or the matching Redis entry had expired.Labels: none
End-to-end cache lookup latency in seconds, including the embedding API call and the Qdrant vector search. The target is under 50 ms (0.05 s) for the full lookup.Labels: noneBuckets (seconds)