Pipeline Overview
Fromsrc/cems/retrieval.py:1-16:
Pipeline Stages
| Stage | LLM Calls | Purpose |
|---|---|---|
| Query Understanding | 1 | Route to optimal strategy (vector vs hybrid) |
| Query Synthesis | 1 | Expand query into 2-5 search terms |
| HyDE | 1 | Generate hypothetical ideal answer |
| Candidate Retrieval | 0 | Fetch from pgvector (HNSW) + tsvector (BM25) |
| RRF Fusion | 0 | Combine results from multiple retrievers |
| LLM Re-ranking | 1 (optional) | Re-rank by actual relevance |
| Relevance Filtering | 0 | Remove results below threshold |
| Score Adjustments | 0 | Time decay, priority, project scoring |
| Token-Budgeted Assembly | 0 | Select results within token budget |
Stage 1: Query Understanding
Purpose: Analyze query intent to select the optimal retrieval strategy.Implementation
Fromsrc/cems/retrieval.py:extract_query_intent():
Routing Logic
Fromsrc/cems/retrieval.py:route_to_strategy():
- Vector mode (fast, 0 LLM calls):
- Simple queries without reasoning requirements
- High-confidence single-domain queries
- Hybrid mode (thorough, 3-4 LLM calls):
- Complex queries requiring reasoning
- Multi-domain queries (2+ domains)
- Moderate/complex complexity
Stage 2: Query Synthesis
Purpose: Expand the query into multiple search terms to improve recall.Standard Expansion
Fromsrc/cems/retrieval.py:synthesize_query():
- Stay within the SAME specific domain/topic
- No generalizing to broader categories
- Prefer specific technical terms over generic words
Temporal Queries
Detected by patterns like: “first”, “last”, “before”, “after”, “when”- Focus on events, dates, sequences
- Include date-related terms
- Search for BOTH events in comparison queries
Preference Queries
Detected by patterns like: “recommend”, “suggest”, “resources”, “what should”- Question phrasing: “recommend video editing resources?”
- Answer phrasing: “I use Adobe Premiere Pro”
- Synthesis generates declarative user statements
RAP (Retrieval-Augmented Prompting)
Fromsrc/cems/retrieval.py:extract_profile_context():
For preference queries, the system first performs a quick profile probe:
- Search existing memories for user preferences
- Extract 5 key phrases (“I use X”, “I prefer Y”)
- Include in synthesis prompt as dynamic examples
- LLM generates domain-specific expansions
Stage 3: HyDE (Hypothetical Document Embeddings)
Purpose: Generate what an ideal answer would look like, then search for documents similar to that answer. Fromsrc/cems/retrieval.py:generate_hypothetical_memory():
Standard HyDE
Temporal HyDE
Preference HyDE
Stage 4: Candidate Retrieval
Purpose: Fetch candidates from PostgreSQL using vector and full-text search.Vector Search (HNSW)
Fromsrc/cems/memory/search.py:_search_raw_async():
Full-Text Search (BM25)
Fromsrc/cems/memory/search.py:_search_lexical_raw_async():
Hybrid Search
Combines vector and BM25 results:Stage 5: RRF Fusion (Reciprocal Rank Fusion)
Purpose: Combine results from multiple retrievers (original query, expansions, HyDE) into a single ranked list. Fromsrc/cems/retrieval.py:reciprocal_rank_fusion():
RRF Formula
QMD Enhancements
List weights:- Original query: 2.0x weight
- Query expansions: 1.0x weight
- HyDE: 1.0x weight
- Rank 1: +0.05 bonus
- Ranks 2-3: +0.02 bonus
Score Normalization and Blending
Stage 6: LLM Re-ranking (Optional)
Purpose: Use LLM to re-rank candidates by actual relevance, catching semantic mismatches. Fromsrc/cems/retrieval.py:rerank_with_llm():
When It Runs
- Enabled in hybrid mode for complex queries
- Optional (can be disabled for performance)
- Operates on top 40 candidates (configurable)
Prompt
Output
Score Blending
Stage 7: Relevance Filtering
Purpose: Remove results below a relevance threshold.Stage 8: Score Adjustments
Purpose: Apply metadata-based scoring adjustments. Fromsrc/cems/retrieval.py:apply_score_adjustments():
Priority Boost
Time Decay
Pinned Boost
Project-Scoped Scoring
Stage 9: Token-Budgeted Assembly
Purpose: Select results that fit within the token budget for context injection.Standard Assembly
Fromsrc/cems/retrieval.py:assemble_context():
MMR Assembly (for Aggregation Queries)
Fromsrc/cems/retrieval.py:assemble_context_diverse():
For queries requiring information from multiple sessions (e.g., “How many doctors did I visit?”):
Search Modes
CEMS supports three search modes:Vector Mode
- LLM calls: 0
- Strategy: Vector search only (HNSW)
- Use case: Fast, simple queries with high confidence
- Latency: ~50ms
Hybrid Mode
- LLM calls: 3-4
- Strategy: Full pipeline (synthesis + HyDE + RRF + reranking)
- Use case: Complex queries, preference queries, multi-domain
- Latency: ~800ms
Auto Mode (Default)
- LLM calls: 1 (for routing) + 0 or 3-4
- Strategy: Query understanding routes to vector or hybrid
- Use case: General-purpose (balances speed and accuracy)
- Latency: ~100ms (vector) or ~900ms (hybrid)
Performance Optimizations
Lexical Signal Detection
Fromsrc/cems/retrieval.py:is_strong_lexical_signal():
If BM25 returns a strong match with a large gap:
Batch Embedding
Fromsrc/cems/embedding.py:embed_batch():
Chunk-Level Deduplication
Fromsrc/cems/memory/search.py:_dedupe_by_document():
Configuration
Related Concepts
- Memory Types - How categories affect search
- How It Works - Integration with IDE hooks
- Architecture - Storage and indexing details