RAG Engine Architecture

Overview

The RAGEngine class is the central orchestrator of Quest’s on-device RAG system. It combines exact matching, semantic retrieval, and local LLM inference to provide intelligent responses to coding questions.

Core Architecture

The RAG engine operates in three distinct phases:

Exact Matching - Fast hash map lookup for known problem titles
Semantic Retrieval - FAISS-based similarity search for related solutions
LLM Generation - Context-aware response generation using Ollama

Class Initialization

class RAGEngine:
    def __init__(
        self,
        retriever: LeetCodeRetriever,
        ollama_url: str = "http://localhost:11434/api/generate",
        model_name: str = "qwen2.5-coder:1.5b",
        reasoning_model: str = "deepseek-r1:7b",
        mode: str = "general",
        temperature: float = 0.4,
        top_p: float = 0.9,
        confidence_threshold: float = 0.7,
        repeat_penalty: float = 1.1,
        num_thread: int = 8,
        max_history: int = 3
    )

The RAG engine supports two inference modes: general (fast, using qwen2.5-coder:1.5b) and reasoning (deeper analysis, using deepseek-r1:7b).

Exact Matching System

Quest implements a hash map-based exact matching system for O(1) lookup of known problems.

How It Works

def _build_exact_match_map(self) -> dict:
    """Build a hash map for exact match search."""
    exact_match_map = {}
    for solution in self.retriever.solutions:
        normalized_title = self._normalize_title(solution.title)
        exact_match_map[normalized_title] = solution
    return exact_match_map

def _normalize_title(self, title: str) -> str:
    """Normalize a title for exact match search."""
    return title.strip().lower()

Key Features:

Title normalization (lowercase, stripped whitespace)
Direct solution retrieval without LLM inference
Bypasses retrieval and generation for exact matches

When you know the exact problem title (e.g., “two sum”), Quest returns the solution instantly without using the LLM, saving time and compute resources.

Retrieval + Generation Pipeline

When exact matching fails, Quest uses a multi-stage retrieval and generation pipeline:

Step 1: Semantic Retrieval

retrieved_solutions = self.retriever.search(
    query, k=k, return_scores=True)
filtered_solutions = [
    sol for sol in retrieved_solutions 
    if hasattr(sol, 'score') and float(sol.score) >= min_confidence
]

The engine retrieves k similar solutions using FAISS and filters them based on a minimum confidence threshold.

Step 2: Confidence-Based Fallback

if not filtered_solutions and k < 5:
    return self.answer_question(
        query, 
        k=k + 2, 
        min_confidence=min_confidence - 0.1
    )

If no solutions meet the confidence threshold, Quest automatically expands the search by:

Increasing k by 2 (retrieving more candidates)
Reducing min_confidence by 0.1 (relaxing the threshold)

Step 3: Prompt Generation

The engine generates a structured prompt incorporating:

Conversation history
Retrieved solutions with confidence scores
Mode-specific instructions (general vs reasoning)

def generate_enhanced_prompt(self, query: str, context: List[Solution]) -> str:
    history_context = self.conversation_history.get_context()
    
    if self.mode == "reasoning":
        base_prompt = PromptTemplates.reasoning_prompt(query, context)
    else:
        base_prompt = PromptTemplates.general_prompt(query, context)
    
    enhanced_prompt = (
        f"Conversation History:\n{history_context}\n\n"
        f"Query: {query}\n\n"
        f"Context: {context}\n\n"
        f"Instruction: {base_prompt}"
    )
    return enhanced_prompt

Step 4: LLM Inference

Quest calls Ollama with the enhanced prompt and streams the response:

def call_ollama(self, prompt: str) -> str:
    model = self.reasoning_model if self.mode == "reasoning" else self.model_name
    
    payload = {
        "model": model,
        "prompt": prompt,
        "temperature": self.temperature,
        "top_p": self.top_p,
        "num_thread": self.num_thread,
        "repeat_penalty": self.repeat_penalty,
        "stream": True
    }
    
    response = requests.post(self.ollama_url, json=payload, stream=True)
    # Stream processing...

Confidence Thresholds

Quest uses confidence scores to ensure response quality:

Parameter	Default Value	Purpose
`confidence_threshold`	0.7	Global threshold for filtering
`min_confidence`	0.6	Per-query minimum confidence

How Confidence Works:

Each retrieved solution has a similarity score (distance in vector space)
Solutions below min_confidence are filtered out
If no solutions pass, Quest expands the search automatically
Sorted solutions appear in prompts with confidence scores

for idx, solution in enumerate(sorted_solutions):
    prompt += f"\n[{idx+1}] {solution.title} (Confidence: {solution.score:.2f}):\n{solution_text}\n"

Stopping and Resetting Generation

Quest supports canceling ongoing generation:

def stop(self):
    """Stop the ongoing generation process."""
    self.stop_generation = True
    logger.info("Generation process stopped.")

def reset(self):
    """Reset the stop flag to allow new generations."""
    self.stop_generation = False
    logger.info("Generation process reset.")

View complete answer_question method

def answer_question(
    self,
    query: str,
    k: int = 5,
    min_confidence: float = 0.6
) -> str:
    """Answer a question using the enhanced RAG engine."""
    try:
        # Reset the stop flag
        self.reset()
        
        # Check for exact match
        normalized_query = self._normalize_title(query)
        if normalized_query in self.exact_match_map:
            exact_match_solution = self.exact_match_map[normalized_query]
            logger.info("Exact match found. Returning solution directly.")
            return f"Exact Match Solution:\n{exact_match_solution.solution}"
        
        # Retrieve relevant context
        retrieved_solutions = self.retriever.search(
            query, k=k, return_scores=True)
        filtered_solutions = [
            sol for sol in retrieved_solutions 
            if hasattr(sol, 'score') and float(sol.score) >= min_confidence
        ]
        
        # Fallback if no solutions meet confidence threshold
        if not filtered_solutions and k < 5:
            return self.answer_question(
                query, k=k + 2, min_confidence=min_confidence - 0.1)
        
        # Generate enhanced prompt
        prompt = self.generate_enhanced_prompt(query, filtered_solutions)
        
        # Get response from Ollama
        response = self.call_ollama(prompt)
        
        # Add to conversation history
        self.conversation_history.add_query(query, response)
        
        # Filter response if in reasoning mode
        if self.mode == "reasoning":
            response = self.filter_reasoning_response(response)
        
        return f"Generated Solution:\n{response}"
    except Exception as e:
        logger.error(f"Failed to answer question: {e}")
        return "An error occurred while generating the response."

Usage Example

from src.DSAAssistant.components.retriever2 import LeetCodeRetriever
from rag_engine import RAGEngine

# Initialize retriever and engine
retriever = LeetCodeRetriever()
rag_engine = RAGEngine(retriever, max_history=3)

# Set mode
rag_engine.set_mode("general")

# Ask a question
answer = rag_engine.answer_question(
    "How do I solve the two sum problem?", 
    k=3
)
print(answer)

Retrieval System

Learn about FAISS indexing and similarity search

Inference Modes

Understand general vs reasoning modes

Memory Buffer

Explore conversation history management

Get Started

Core Concepts

Guides

Configuration

RAG Engine Architecture

Overview

Core Architecture

Class Initialization

Exact Matching System

How It Works

Retrieval + Generation Pipeline

Step 1: Semantic Retrieval

Step 2: Confidence-Based Fallback

Step 3: Prompt Generation

Step 4: LLM Inference

Confidence Thresholds

Stopping and Resetting Generation

Usage Example

Retrieval System

Inference Modes

Memory Buffer

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

​Overview

​Core Architecture

​Class Initialization

​Exact Matching System

​How It Works

​Retrieval + Generation Pipeline

​Step 1: Semantic Retrieval

​Step 2: Confidence-Based Fallback

​Step 3: Prompt Generation

​Step 4: LLM Inference

​Confidence Thresholds

​Stopping and Resetting Generation

​Usage Example

​Related Documentation

Retrieval System

Inference Modes

Memory Buffer

Build docs developers (and LLMs) love

Overview

Core Architecture

Class Initialization

Exact Matching System

How It Works

Retrieval + Generation Pipeline

Step 1: Semantic Retrieval

Step 2: Confidence-Based Fallback

Step 3: Prompt Generation

Step 4: LLM Inference

Confidence Thresholds

Stopping and Resetting Generation

Usage Example

Related Documentation