Skip to main content
The RAG (Retrieval-Augmented Generation) retrieval pipeline powers RepoRAGX’s question-answering capabilities. It converts user queries into embeddings, searches for similar code chunks, and uses an LLM to generate accurate, contextual answers.

Retrieval pipeline overview

The retrieval process follows five steps for each query: This happens in real-time for every question asked.

Step 1: Embed the query

Component: EmbeddingManager (src/rag/embedding_manager.py) The user’s natural language query is converted to the same 384-dimensional vector space as the code chunks:
query = "How does authentication work?"
query_embedding = embedding_manager.generate_embeddings([query])[0]

Same model requirement

Critical: The query must be embedded using the exact same model (all-MiniLM-L6-v2) that was used during data ingestion. Different models produce incompatible vector spaces.
The embedding process is identical to document embedding:
  1. Tokenize the query text
  2. Pass through the neural network
  3. Extract the 384-dimensional sentence embedding
  4. Normalize the vector for cosine similarity

Output format

Produces a single numpy array of shape (384,):
query_embedding.shape  # (384,)
This vector encodes the semantic meaning of the query. Component: RAGRetriever (src/rag/rag_retriever.py) The query embedding is compared against all stored document embeddings using cosine similarity:
results = vector_store.collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=top_k
)

Cosine similarity explained

Cosine similarity measures the angle between two vectors, ranging from -1 to 1:
1

Score = 1 (0°)

Vectors point in exactly the same direction—perfect semantic match
2

Score = 0 (90°)

Vectors are orthogonal—unrelated concepts
3

Score = -1 (180°)

Vectors point in opposite directions—contradictory meanings
Formula:
cosine_similarity = (A · B) / (||A|| × ||B||)
Where A is the query vector and B is a document vector.

ChromaDB distance to similarity

ChromaDB returns cosine distance (not similarity), so it’s converted:
similarity_score = 1 - distance
Implemented in rag_retriever.py:27
Distance = 0.2Similarity = 0.8 (80% match)Distance = 0.5Similarity = 0.5 (50% match)

Top-K retrieval

By default, the top 5 most similar chunks are retrieved:
retriever.retrieve(query, top_k=5)
This balances context quality with token limits. More results provide broader context but may dilute relevance.

Step 3: Filter by threshold

Component: RAGRetriever (src/rag/rag_retriever.py:29) Retrieved results can be filtered by minimum similarity score:
if similarity_score >= score_threshold:
    retrieved_docs.append({
        'id': doc_id,
        'content': document,
        'metadata': metadata,
        'similarity_score': similarity_score,
        'distance': distance,
        'rank': i + 1
    })

Default threshold: 0.0

The default threshold of 0.0 accepts all results, relying on top-k ranking instead:
retriever.retrieve(query, top_k=5, score_threshold=0.0)
For stricter filtering, increase the threshold:
  • 0.5: Moderate similarity required
  • 0.7: High similarity required
  • 0.9: Near-identical matches only

Step 4: Retrieve document chunks

Component: RAGRetriever (src/rag/rag_retriever.py:1-48) Each retrieved result contains:
Unique identifier in format doc_{uuid}_{index}:
'id': 'doc_a3f2b1c0_42'
The full text of the code chunk:
'content': 'def authenticate(user, password):\n    ...'
Original file information:
'metadata': {
    'path': 'src/auth/login.py',
    'doc_index': 42,
    'content_length': 856,
    'repo': 'owner/repo'
}
Relevance scores:
'similarity_score': 0.8234,  # Cosine similarity
'distance': 0.1766,          # Cosine distance
'rank': 1                     # Position in results

Complete retrieval flow

Implementation in rag_retriever.py:7-47:
def retrieve(self, query, top_k=5, score_threshold=0.0):
    # Step 1: Embed query
    query_embedding = self.embedding_manager.generate_embeddings([query])[0]
    
    # Step 2: Search vector store
    results = self.vector_store.collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k
    )
    
    # Step 3: Process results
    retrieved_docs = []
    documents = results['documents'][0]
    metadatas = results['metadatas'][0]
    distances = results['distances'][0]
    ids = results['ids'][0]
    
    # Step 4: Build response objects
    for i, (doc_id, document, metadata, distance) in enumerate(
        zip(ids, documents, metadatas, distances)
    ):
        similarity_score = 1 - distance
        
        if similarity_score >= score_threshold:
            retrieved_docs.append({
                'id': doc_id,
                'content': document,
                'metadata': metadata,
                'similarity_score': similarity_score,
                'distance': distance,
                'rank': i + 1
            })
    
    return retrieved_docs

Step 5: Generate LLM answer

Component: GroqLLM (src/rag/groq_llm.py) Retrieved chunks are combined with the query and sent to an LLM for answer generation:
llm = GroqLLM(model_name="llama-3.3-70b-versatile")
answer = llm.rag(query=query, retriever=rag_retriever)

Context building

Retrieved documents are formatted into context:
context_parts = []
for doc in results:
    meta = doc.get("metadata", {})
    header = f"File: {meta.get('path', 'unknown')}"
    context_parts.append(f"--- {header} ---\n{doc['content']}")

context = "\n\n".join(context_parts)
Implementation: groq_llm.py:36-42

Example context format

--- File: src/auth/login.py ---
def authenticate(user, password):
    hashed = hash_password(password)
    return db.verify(user, hashed)

--- File: src/auth/utils.py ---
def hash_password(password):
    return bcrypt.hashpw(password.encode(), bcrypt.gensalt())

--- File: src/models/user.py ---
class User:
    def __init__(self, username, password_hash):
        self.username = username
        self.password_hash = password_hash

Prompt construction

The final prompt combines context and query:
prompt = f"""
Use the following context to answer the question concisely.

Context:
{context}

Question: {query}

Answer:
"""

response = self.llm.invoke(prompt)
return response.content
Implementation: groq_llm.py:44-56

LLM configuration

The Groq LLM is initialized with specific parameters:
GroqLLM(
    model_name="llama-3.3-70b-versatile",
    temperature=0.1,      # Low temperature for factual accuracy
    max_tokens=1024       # Limit response length
)
Temperature = 0.1: Produces deterministic, focused answers by reducing randomness. Higher values (0.7+) would generate more creative but potentially less accurate responses.

Supported models

Groq supports multiple LLM options:
  • llama-3.3-70b-versatile (default)
  • llama-3.1-70b-versatile
  • mixtral-8x7b-32768
  • gemma-7b-it
See Groq documentation for the full list.

Complete retrieval flow

Here’s the end-to-end process as implemented in src/main.py:49-54:
# Interactive query loop
while True:
    # Get user query
    query = input("\nAsk anything ('exit' to quit): ")
    if query.strip().lower() == "exit":
        break
    
    # Run RAG pipeline
    answer = llm.rag(query=query, retriever=rag_retriever)
    
    # Display answer
    print(answer)
The llm.rag() method orchestrates:
  1. Embedding the query
  2. Retrieving relevant chunks
  3. Building context
  4. Generating the answer

Performance characteristics

Query embedding

~10ms on modern CPUsSingle query embedding is nearly instantaneous

Vector search

~50-200ms for 10k chunksHNSW index provides O(log n) search time

LLM generation

~1-5 secondsDepends on context length and model choice

Total latency

~2-6 secondsFrom query input to answer display

Handling edge cases

The pipeline gracefully handles various scenarios:
1

No results found

Returns a clear message when no relevant context exists:
if not results:
    return "No relevant context found to answer the question."
Implementation: groq_llm.py:33-34
2

Retrieval errors

Catches exceptions and returns empty results:
except Exception as e:
    print(f"Error during retrieval: {e}")
    return []
Implementation: rag_retriever.py:45-47
3

Ambiguous queries

The LLM is instructed to answer “concisely” and may indicate when context is insufficient to provide a definitive answer.

Optimization strategies

Fewer results (k=3):
  • Faster retrieval
  • More focused context
  • Risk missing relevant information
More results (k=10):
  • Broader context
  • Better recall
  • May exceed token limits
  • Slower LLM processing
Default k=5 balances these trade-offs.
Filter out low-quality matches:
retriever.retrieve(query, top_k=10, score_threshold=0.6)
Returns fewer but higher-quality results.
Low temperature (0.0-0.3):
  • More factual
  • Deterministic
  • Better for code questions
High temperature (0.7-1.0):
  • More creative
  • Varied responses
  • Better for brainstorming

Query examples

Here’s how different queries are processed:
Query: "How does the authenticate function work?"

Retrieval:
- Finds chunks containing "authenticate"
- High similarity to function definitions
- Returns 3-5 relevant code snippets

Answer:
"The authenticate function takes a user and password,
hashes the password using bcrypt, and verifies it
against the database..."

Debugging retrieval

The system prints detailed logs during retrieval:
Retrieving documents for query: 'How does authentication work?'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...
Retrieved 5 documents (after filtering)
You can inspect results before LLM generation:
results = rag_retriever.retrieve(query, top_k=5)
for doc in results:
    print(f"Rank {doc['rank']}: {doc['metadata']['path']}")
    print(f"Similarity: {doc['similarity_score']:.3f}")
    print(f"Preview: {doc['content'][:100]}...\n")

Next steps

How it works

Review the complete two-pipeline architecture

Data ingestion

Learn how the vector database is built

Build docs developers (and LLMs) love