How it works
Cache lookup is based on semantic similarity, not exact string matching. Two prompts that ask the same question in different words will hit the same cache entry.Embed the prompt
The prompt text (system and user messages) is embedded using
text-embedding-3-small, producing a 1536-dimensional vector. This call goes to the OpenAI API and is the dominant source of lookup latency (~10–15 ms of the total ~10–15 ms overhead).Nearest-neighbor search in Qdrant
The embedding vector is sent to Qdrant for a nearest-neighbor lookup. Qdrant returns the closest stored vector along with its cosine similarity score and a payload containing the Redis key for the cached response.
Similarity threshold check
If the cosine similarity is above 0.95, the match is considered close enough to serve. If the score is at or below 0.95, the lookup is a miss and the request proceeds to the drafter pipeline.
Redis retrieval
On a vector hit, the Redis key from the Qdrant payload is used to fetch the full serialised response from Redis. Redis holds the actual response bytes and enforces the TTL.
Lazy cleanup on TTL expiry
If Qdrant returns a match but the Redis key no longer exists (the TTL has expired), the Qdrant point is an orphan. The cache performs lazy cleanup: it deletes the orphaned Qdrant point and records a miss.
Lookup method in internal/cache/store.go implements this flow directly:
The lazy cleanup path —
log.Printf("cache: lazy cleanup of orphaned qdrant point %s", result.ID) — fires when Qdrant finds a vector match but Redis has no corresponding entry (TTL expired). The orphaned Qdrant point is deleted inline and the request is treated as a miss. This avoids a separate background cleanup job and keeps the vector index consistent over time without additional operational overhead.What gets cached
Only draft-accepted responses are inserted into the cache. Escalated responses — where the drafter demonstrated high uncertainty — are not cached. This ensures that cached responses come from a model path that expressed confidence in the answer. Caching an escalated response would defeat the purpose: the drafter was uncertain about that question, so future semantically similar prompts should not be served a stale heavyweight response without going through the pipeline.Two-store architecture
The cache uses two separate stores for different responsibilities:| Store | Role | Technology |
|---|---|---|
| Qdrant | Vector similarity index for nearest-neighbor lookup | Self-hosted, purpose-built vector database |
| Redis | TTL-based metadata and response storage | In-memory KV store with native TTL support |
redis_key) to the actual response data in Redis. When the Redis entry expires, the next lookup against the Qdrant point triggers lazy cleanup.
Design decisions
| Decision | Choice | Rationale |
|---|---|---|
| Embedding model | text-embedding-3-small | 1536-dim via OpenAI API, consistent with hosted model philosophy |
| Similarity threshold | 0.95 | Intentionally conservative to avoid serving stale or semantically drifted answers |
| Vector store | Qdrant | Self-hosted, lightweight, purpose-built for nearest-neighbor |
| Metadata store | Redis | TTLs, eviction tracking, rate counters |
| Cache key | Prompt embedding | Semantic similarity, not exact match |
| Invalidation | TTL + manual eviction API | Entries expire by time; bad responses can be evicted immediately via API |
Phase 5 metrics
| Metric | Type | Description |
|---|---|---|
draftthinker_cache_hits_total | Counter | Total cache hits (response returned from cache) |
draftthinker_cache_misses_total | Counter | Total cache misses (no similar prompt or expired Redis entry) |
draftthinker_cache_lookup_latency_seconds | Histogram | End-to-end cache lookup latency including embedding and vector search |
cache_lookup_latency_seconds uses buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5.
Phase 5 also adds a new decision label value to draftthinker_routing_decisions_total:
cache_hit: response served from semantic cache, skipping the entire draft pipeline
Interpreting cache metrics
- Hit rate =
cache_hits_total/ (cache_hits_total+cache_misses_total). The design target is above 15% at steady state over a one-hour window. Low hit rate early in deployment is expected — the cache warms over time as draft-accepted responses accumulate. - Lookup latency = dominated by the embedding API call. The target is below 50 ms end-to-end (embed + Qdrant search + Redis get). Watch the
cache_lookup_latency_secondsp99 bucket at 0.05 s to confirm the target is met. If p99 is consistently above 50 ms, the embedding API is the first thing to investigate.