Retrieval pipeline overview
The retrieval process follows five steps for each query: This happens in real-time for every question asked.Step 1: Embed the query
Component:EmbeddingManager (src/rag/embedding_manager.py)
The user’s natural language query is converted to the same 384-dimensional vector space as the code chunks:
Same model requirement
Critical: The query must be embedded using the exact same model (
all-MiniLM-L6-v2) that was used during data ingestion. Different models produce incompatible vector spaces.- Tokenize the query text
- Pass through the neural network
- Extract the 384-dimensional sentence embedding
- Normalize the vector for cosine similarity
Output format
Produces a single numpy array of shape(384,):
Step 2: Similarity search
Component:RAGRetriever (src/rag/rag_retriever.py)
The query embedding is compared against all stored document embeddings using cosine similarity:
Cosine similarity explained
Cosine similarity measures the angle between two vectors, ranging from -1 to 1:
Formula:
A is the query vector and B is a document vector.
ChromaDB distance to similarity
ChromaDB returns cosine distance (not similarity), so it’s converted:rag_retriever.py:27
Distance = 0.2 → Similarity = 0.8 (80% match)Distance = 0.5 → Similarity = 0.5 (50% match)
Top-K retrieval
By default, the top 5 most similar chunks are retrieved:Step 3: Filter by threshold
Component:RAGRetriever (src/rag/rag_retriever.py:29)
Retrieved results can be filtered by minimum similarity score:
Default threshold: 0.0
The default threshold of0.0 accepts all results, relying on top-k ranking instead:
Step 4: Retrieve document chunks
Component:RAGRetriever (src/rag/rag_retriever.py:1-48)
Each retrieved result contains:
Document ID
Document ID
Unique identifier in format
doc_{uuid}_{index}:Content
Content
The full text of the code chunk:
Metadata
Metadata
Original file information:
Similarity metrics
Similarity metrics
Relevance scores:
Complete retrieval flow
Implementation inrag_retriever.py:7-47:
Step 5: Generate LLM answer
Component:GroqLLM (src/rag/groq_llm.py)
Retrieved chunks are combined with the query and sent to an LLM for answer generation:
Context building
Retrieved documents are formatted into context:groq_llm.py:36-42
Example context format
Prompt construction
The final prompt combines context and query:groq_llm.py:44-56
LLM configuration
The Groq LLM is initialized with specific parameters:Temperature = 0.1: Produces deterministic, focused answers by reducing randomness. Higher values (0.7+) would generate more creative but potentially less accurate responses.
Supported models
Groq supports multiple LLM options:llama-3.3-70b-versatile(default)llama-3.1-70b-versatilemixtral-8x7b-32768gemma-7b-it
Complete retrieval flow
Here’s the end-to-end process as implemented insrc/main.py:49-54:
llm.rag() method orchestrates:
- Embedding the query
- Retrieving relevant chunks
- Building context
- Generating the answer
Performance characteristics
Query embedding
~10ms on modern CPUsSingle query embedding is nearly instantaneous
Vector search
~50-200ms for 10k chunksHNSW index provides O(log n) search time
LLM generation
~1-5 secondsDepends on context length and model choice
Total latency
~2-6 secondsFrom query input to answer display
Handling edge cases
The pipeline gracefully handles various scenarios:No results found
Returns a clear message when no relevant context exists:Implementation:
groq_llm.py:33-34Retrieval errors
Catches exceptions and returns empty results:Implementation:
rag_retriever.py:45-47Optimization strategies
Adjust top-k for context quality
Adjust top-k for context quality
Fewer results (k=3):
- Faster retrieval
- More focused context
- Risk missing relevant information
- Broader context
- Better recall
- May exceed token limits
- Slower LLM processing
k=5 balances these trade-offs.Use score thresholds for precision
Use score thresholds for precision
Filter out low-quality matches:Returns fewer but higher-quality results.
Experiment with temperature
Experiment with temperature
Low temperature (0.0-0.3):
- More factual
- Deterministic
- Better for code questions
- More creative
- Varied responses
- Better for brainstorming
Query examples
Here’s how different queries are processed:Debugging retrieval
The system prints detailed logs during retrieval:Next steps
How it works
Review the complete two-pipeline architecture
Data ingestion
Learn how the vector database is built