Metrics reference

Draft Thinker exposes Prometheus metrics at /metrics by default. The path is configurable via metrics.path in config.yaml.

config.yaml

metrics:
  path: /metrics

Metrics are grouped by the phase that introduced them. All metric names use the draftthinker_ prefix.

Phase 1 — Foundation metrics

These three instruments measure baseline gateway behavior: total throughput, upstream response time, and error volume.

draftthinker_requests_total

Counter

Total number of requests processed by the gateway, partitioned by model name and HTTP status code.Labels

Label	Values
`model`	Model name string (e.g. `gpt-4.1-nano`, `gpt-4.1`)
`status`	HTTP status code as a string (e.g. `200`, `429`, `500`)

draftthinker_upstream_latency_seconds

Histogram

End-to-end latency for upstream LLM provider calls, in seconds. Observe this per provider to compare drafter vs. heavyweight response times.Labels

Label	Values
`provider`	`drafter` or `heavyweight`

Buckets (seconds)

0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30

draftthinker_errors_total

Counter

Total number of errors, partitioned by error type.Labels

Label	Values
`type`	`invalid_request`, `routing_error`, `upstream_error`, `upstream_timeout`, `stream_error`, `internal_error`

Phase 2 — Entropy routing metrics

These instruments track the entropy-based routing engine added in Phase 2. Entropy is computed as per-token Shannon entropy in bits; requests above the configured threshold are escalated to the heavyweight model.

draftthinker_entropy_distribution

Histogram

Distribution of per-token Shannon entropy values observed during drafter inference, measured in bits. Use this histogram to understand how requests distribute across the entropy scale and to validate your escalation threshold.Labels: noneBuckets (bits)

0, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 2.5, 3.0

draftthinker_routing_decisions_total

Counter

Total routing decisions made by the gateway, partitioned by outcome. Phase 5 added a third decision value (cache_hit).Labels

Label	Value	Meaning
`decision`	`accept`	Drafter response served directly
`decision`	`escalate`	Request forwarded to heavyweight model
`decision`	`cache_hit`	Response served from semantic cache (Phase 5)

Phase 4 — Speculative execution metrics

Speculative execution fires a parallel heavyweight call when the drafter’s entropy crosses a soft threshold, giving the heavyweight a head start before the hard escalation threshold is reached. These three instruments track when that mechanism engages, how often it is wasted, and how much latency it saves.

draftthinker_speculative_triggers_total

Counter

Total number of times the gateway fired a speculative heavyweight call in parallel (soft threshold exceeded).Labels: none

draftthinker_speculative_cancellations_total

Counter

Total number of speculative heavyweight calls that were cancelled because the drafter recovered before the hard escalation threshold was reached. A high cancellation ratio indicates the soft threshold may be set too aggressively.Labels: none

draftthinker_speculative_latency_saved_seconds

Histogram

The head-start latency saved on escalated requests that had a running speculative heavyweight call, in seconds. Higher values indicate more user-facing latency was eliminated by speculative execution.Labels: noneBuckets (seconds)

0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30

Phase 5 — Cache metrics

The semantic cache layer introduced in Phase 5 uses embedding similarity to serve repeated or near-duplicate prompts without hitting either upstream model. These instruments measure cache effectiveness and lookup cost.

draftthinker_cache_hits_total

Counter

Total number of cache hits: a semantically similar prompt was found in the vector index and the cached response was returned.Labels: none

draftthinker_cache_misses_total

Counter

Total number of cache misses: no sufficiently similar prompt was found, or the matching Redis entry had expired.Labels: none

draftthinker_cache_lookup_latency_seconds

Histogram

End-to-end cache lookup latency in seconds, including the embedding API call and the Qdrant vector search. The target is under 50 ms (0.05 s) for the full lookup.Labels: noneBuckets (seconds)

0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5

Derived metrics

These ratios are the primary signals for evaluating gateway health and efficiency. None of them require additional instrumentation — they are computed from the counters above at query time.

Draft acceptance rate

The fraction of routed requests where the drafter’s response was served directly, without escalation.

sum(rate(draftthinker_routing_decisions_total{decision="accept"}[5m]))
/
sum(rate(draftthinker_routing_decisions_total[5m]))

Escalation rate

The fraction of routed requests forwarded to the heavyweight model.

sum(rate(draftthinker_routing_decisions_total{decision="escalate"}[5m]))
/
sum(rate(draftthinker_routing_decisions_total[5m]))

Cache hit rate

The fraction of all requests served from the semantic cache, bypassing the entire draft pipeline.

sum(rate(draftthinker_cache_hits_total[5m]))
/
(
  sum(rate(draftthinker_cache_hits_total[5m]))
  + sum(rate(draftthinker_cache_misses_total[5m]))
)

Speculative cancellation ratio

The fraction of speculative heavyweight calls that were triggered but ultimately not used. This represents wasted compute. Target below 10% of total escalation cost.

sum(rate(draftthinker_speculative_cancellations_total[5m]))
/
sum(rate(draftthinker_speculative_triggers_total[5m]))

Get Started

How It Works

Deployment

Observability

Phase 1 — Foundation metrics

Phase 2 — Entropy routing metrics

Phase 4 — Speculative execution metrics

Phase 5 — Cache metrics

Derived metrics

Draft acceptance rate

Escalation rate

Cache hit rate

Speculative cancellation ratio

Build docs developers (and LLMs) love

Get Started

How It Works

Deployment

Observability

​Phase 1 — Foundation metrics

​Phase 2 — Entropy routing metrics

​Phase 4 — Speculative execution metrics

​Phase 5 — Cache metrics

​Derived metrics

​Draft acceptance rate

​Escalation rate

​Cache hit rate

​Speculative cancellation ratio

Build docs developers (and LLMs) love

Phase 1 — Foundation metrics

Phase 2 — Entropy routing metrics

Phase 4 — Speculative execution metrics

Phase 5 — Cache metrics

Derived metrics

Draft acceptance rate

Escalation rate

Cache hit rate

Speculative cancellation ratio