Skip to main content
Draft Thinker exposes Prometheus metrics at /metrics by default. The path is configurable via metrics.path in config.yaml.
config.yaml
metrics:
  path: /metrics
Metrics are grouped by the phase that introduced them. All metric names use the draftthinker_ prefix.

Phase 1 — Foundation metrics

These three instruments measure baseline gateway behavior: total throughput, upstream response time, and error volume.
draftthinker_requests_total
Counter
Total number of requests processed by the gateway, partitioned by model name and HTTP status code.Labels
LabelValues
modelModel name string (e.g. gpt-4.1-nano, gpt-4.1)
statusHTTP status code as a string (e.g. 200, 429, 500)
draftthinker_upstream_latency_seconds
Histogram
End-to-end latency for upstream LLM provider calls, in seconds. Observe this per provider to compare drafter vs. heavyweight response times.Labels
LabelValues
providerdrafter or heavyweight
Buckets (seconds)
0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30
draftthinker_errors_total
Counter
Total number of errors, partitioned by error type.Labels
LabelValues
typeinvalid_request, routing_error, upstream_error, upstream_timeout, stream_error, internal_error

Phase 2 — Entropy routing metrics

These instruments track the entropy-based routing engine added in Phase 2. Entropy is computed as per-token Shannon entropy in bits; requests above the configured threshold are escalated to the heavyweight model.
draftthinker_entropy_distribution
Histogram
Distribution of per-token Shannon entropy values observed during drafter inference, measured in bits. Use this histogram to understand how requests distribute across the entropy scale and to validate your escalation threshold.Labels: noneBuckets (bits)
0, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 2.5, 3.0
draftthinker_routing_decisions_total
Counter
Total routing decisions made by the gateway, partitioned by outcome. Phase 5 added a third decision value (cache_hit).Labels
LabelValueMeaning
decisionacceptDrafter response served directly
decisionescalateRequest forwarded to heavyweight model
decisioncache_hitResponse served from semantic cache (Phase 5)

Phase 4 — Speculative execution metrics

Speculative execution fires a parallel heavyweight call when the drafter’s entropy crosses a soft threshold, giving the heavyweight a head start before the hard escalation threshold is reached. These three instruments track when that mechanism engages, how often it is wasted, and how much latency it saves.
draftthinker_speculative_triggers_total
Counter
Total number of times the gateway fired a speculative heavyweight call in parallel (soft threshold exceeded).Labels: none
draftthinker_speculative_cancellations_total
Counter
Total number of speculative heavyweight calls that were cancelled because the drafter recovered before the hard escalation threshold was reached. A high cancellation ratio indicates the soft threshold may be set too aggressively.Labels: none
draftthinker_speculative_latency_saved_seconds
Histogram
The head-start latency saved on escalated requests that had a running speculative heavyweight call, in seconds. Higher values indicate more user-facing latency was eliminated by speculative execution.Labels: noneBuckets (seconds)
0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30

Phase 5 — Cache metrics

The semantic cache layer introduced in Phase 5 uses embedding similarity to serve repeated or near-duplicate prompts without hitting either upstream model. These instruments measure cache effectiveness and lookup cost.
draftthinker_cache_hits_total
Counter
Total number of cache hits: a semantically similar prompt was found in the vector index and the cached response was returned.Labels: none
draftthinker_cache_misses_total
Counter
Total number of cache misses: no sufficiently similar prompt was found, or the matching Redis entry had expired.Labels: none
draftthinker_cache_lookup_latency_seconds
Histogram
End-to-end cache lookup latency in seconds, including the embedding API call and the Qdrant vector search. The target is under 50 ms (0.05 s) for the full lookup.Labels: noneBuckets (seconds)
0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5

Derived metrics

These ratios are the primary signals for evaluating gateway health and efficiency. None of them require additional instrumentation — they are computed from the counters above at query time.

Draft acceptance rate

The fraction of routed requests where the drafter’s response was served directly, without escalation.
sum(rate(draftthinker_routing_decisions_total{decision="accept"}[5m]))
/
sum(rate(draftthinker_routing_decisions_total[5m]))

Escalation rate

The fraction of routed requests forwarded to the heavyweight model.
sum(rate(draftthinker_routing_decisions_total{decision="escalate"}[5m]))
/
sum(rate(draftthinker_routing_decisions_total[5m]))

Cache hit rate

The fraction of all requests served from the semantic cache, bypassing the entire draft pipeline.
sum(rate(draftthinker_cache_hits_total[5m]))
/
(
  sum(rate(draftthinker_cache_hits_total[5m]))
  + sum(rate(draftthinker_cache_misses_total[5m]))
)

Speculative cancellation ratio

The fraction of speculative heavyweight calls that were triggered but ultimately not used. This represents wasted compute. Target below 10% of total escalation cost.
sum(rate(draftthinker_speculative_cancellations_total[5m]))
/
sum(rate(draftthinker_speculative_triggers_total[5m]))

Build docs developers (and LLMs) love