Semantic cache

The semantic cache sits at the front of the request pipeline. On a cache hit, the entire draft-verify cycle is skipped — no drafter call, no entropy analysis, no heavyweight escalation. The cached response is returned in under 50 ms.

How it works

Cache lookup is based on semantic similarity, not exact string matching. Two prompts that ask the same question in different words will hit the same cache entry.

Embed the prompt

The prompt text (system and user messages) is embedded using text-embedding-3-small, producing a 1536-dimensional vector. This call goes to the OpenAI API and is the dominant source of lookup latency (~10–15 ms of the total ~10–15 ms overhead).

Nearest-neighbor search in Qdrant

The embedding vector is sent to Qdrant for a nearest-neighbor lookup. Qdrant returns the closest stored vector along with its cosine similarity score and a payload containing the Redis key for the cached response.

Similarity threshold check

If the cosine similarity is above 0.95, the match is considered close enough to serve. If the score is at or below 0.95, the lookup is a miss and the request proceeds to the drafter pipeline.

Redis retrieval

On a vector hit, the Redis key from the Qdrant payload is used to fetch the full serialised response from Redis. Redis holds the actual response bytes and enforces the TTL.

Lazy cleanup on TTL expiry

If Qdrant returns a match but the Redis key no longer exists (the TTL has expired), the Qdrant point is an orphan. The cache performs lazy cleanup: it deletes the orphaned Qdrant point and records a miss.

Cache hit response

If Redis returns the response, it is deserialised and returned to the client. The draftthinker_cache_hits_total counter is incremented and the decision label cache_hit is recorded on draftthinker_routing_decisions_total.

The Lookup method in internal/cache/store.go implements this flow directly:

func (s *VectorStore) Lookup(ctx context.Context, messages []protocol.Message) (*protocol.ChatCompletionResponse, error) {
	start := time.Now()

	keyText := buildKeyText(messages)
	vector, err := s.embedder.Embed(ctx, keyText)
	if err != nil {
		return nil, fmt.Errorf("embedding for lookup: %w", err)
	}

	result, err := s.index.Search(ctx, vector, s.threshold)
	if err != nil {
		return nil, fmt.Errorf("vector search: %w", err)
	}

	if result == nil {
		s.recorder.RecordCacheMiss()
		s.recorder.RecordCacheLookupLatency(time.Since(start))
		return nil, nil
	}

	redisKey, ok := result.Payload["redis_key"]
	if !ok {
		s.recorder.RecordCacheMiss()
		s.recorder.RecordCacheLookupLatency(time.Since(start))
		return nil, nil
	}

	data, found, err := s.kv.Get(ctx, redisKey)
	if err != nil {
		return nil, fmt.Errorf("kv get: %w", err)
	}

	if !found {
		log.Printf("cache: lazy cleanup of orphaned qdrant point %s", result.ID)
		if delErr := s.index.Delete(ctx, []string{result.ID}); delErr != nil {
			log.Printf("cache: failed to delete orphaned point: %v", delErr)
		}
		s.recorder.RecordCacheMiss()
		s.recorder.RecordCacheLookupLatency(time.Since(start))
		return nil, nil
	}

	var resp protocol.ChatCompletionResponse
	if err := json.Unmarshal(data, &resp); err != nil {
		return nil, fmt.Errorf("unmarshaling cached response: %w", err)
	}

	s.recorder.RecordCacheHit()
	s.recorder.RecordCacheLookupLatency(time.Since(start))
	return &resp, nil
}

The lazy cleanup path — log.Printf("cache: lazy cleanup of orphaned qdrant point %s", result.ID) — fires when Qdrant finds a vector match but Redis has no corresponding entry (TTL expired). The orphaned Qdrant point is deleted inline and the request is treated as a miss. This avoids a separate background cleanup job and keeps the vector index consistent over time without additional operational overhead.

What gets cached

Only draft-accepted responses are inserted into the cache. Escalated responses — where the drafter demonstrated high uncertainty — are not cached. This ensures that cached responses come from a model path that expressed confidence in the answer. Caching an escalated response would defeat the purpose: the drafter was uncertain about that question, so future semantically similar prompts should not be served a stale heavyweight response without going through the pipeline.

func (s *VectorStore) Insert(ctx context.Context, messages []protocol.Message, resp *protocol.ChatCompletionResponse) error {
	keyText := buildKeyText(messages)
	vector, err := s.embedder.Embed(ctx, keyText)
	// ...
	id := generateUUID()
	redisKey := "cache:" + id

	data, err := json.Marshal(resp)
	// ...
	if err := s.kv.Set(ctx, redisKey, data); err != nil {
		return fmt.Errorf("kv set: %w", err)
	}

	payload := map[string]string{"redis_key": redisKey}
	if err := s.index.Upsert(ctx, id, vector, payload); err != nil {
		s.kv.Del(ctx, redisKey)
		return fmt.Errorf("vector upsert: %w", err)
	}

	return nil
}

Two-store architecture

The cache uses two separate stores for different responsibilities:

Store	Role	Technology
Qdrant	Vector similarity index for nearest-neighbor lookup	Self-hosted, purpose-built vector database
Redis	TTL-based metadata and response storage	In-memory KV store with native TTL support

Qdrant does not have native TTL support, which is why Redis is the authoritative store for expiry. The Qdrant point holds only a pointer (redis_key) to the actual response data in Redis. When the Redis entry expires, the next lookup against the Qdrant point triggers lazy cleanup.

Design decisions

Decision	Choice	Rationale
Embedding model	`text-embedding-3-small`	1536-dim via OpenAI API, consistent with hosted model philosophy
Similarity threshold	0.95	Intentionally conservative to avoid serving stale or semantically drifted answers
Vector store	Qdrant	Self-hosted, lightweight, purpose-built for nearest-neighbor
Metadata store	Redis	TTLs, eviction tracking, rate counters
Cache key	Prompt embedding	Semantic similarity, not exact match
Invalidation	TTL + manual eviction API	Entries expire by time; bad responses can be evicted immediately via API

The 0.95 threshold is a deliberate conservatism. A lower threshold (e.g., 0.90) would produce more cache hits, but risks returning answers to questions that are related but not equivalent — different time horizon, different entity, different scope. The cache is optimised to never return a wrong answer from a correct-looking hit, at the cost of a lower hit rate.

Phase 5 metrics

Metric	Type	Description
`draftthinker_cache_hits_total`	Counter	Total cache hits (response returned from cache)
`draftthinker_cache_misses_total`	Counter	Total cache misses (no similar prompt or expired Redis entry)
`draftthinker_cache_lookup_latency_seconds`	Histogram	End-to-end cache lookup latency including embedding and vector search

cache_lookup_latency_seconds uses buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5. Phase 5 also adds a new decision label value to draftthinker_routing_decisions_total:

cache_hit: response served from semantic cache, skipping the entire draft pipeline

Interpreting cache metrics

Hit rate = cache_hits_total / (cache_hits_total + cache_misses_total). The design target is above 15% at steady state over a one-hour window. Low hit rate early in deployment is expected — the cache warms over time as draft-accepted responses accumulate.
Lookup latency = dominated by the embedding API call. The target is below 50 ms end-to-end (embed + Qdrant search + Redis get). Watch the cache_lookup_latency_seconds p99 bucket at 0.05 s to confirm the target is met. If p99 is consistently above 50 ms, the embedding API is the first thing to investigate.

Get Started

How It Works

Deployment

Observability

How it works

What gets cached

Two-store architecture

Design decisions

Phase 5 metrics

Interpreting cache metrics

Build docs developers (and LLMs) love

Get Started

How It Works

Deployment

Observability

​How it works

​What gets cached

​Two-store architecture

​Design decisions

​Phase 5 metrics

​Interpreting cache metrics

Build docs developers (and LLMs) love

How it works

What gets cached

Two-store architecture

Design decisions

Phase 5 metrics

Interpreting cache metrics