Skip to main content

Overview

The RAGEngine class is the central orchestrator of Quest’s on-device RAG system. It combines exact matching, semantic retrieval, and local LLM inference to provide intelligent responses to coding questions.

Core Architecture

The RAG engine operates in three distinct phases:
  1. Exact Matching - Fast hash map lookup for known problem titles
  2. Semantic Retrieval - FAISS-based similarity search for related solutions
  3. LLM Generation - Context-aware response generation using Ollama

Class Initialization

class RAGEngine:
    def __init__(
        self,
        retriever: LeetCodeRetriever,
        ollama_url: str = "http://localhost:11434/api/generate",
        model_name: str = "qwen2.5-coder:1.5b",
        reasoning_model: str = "deepseek-r1:7b",
        mode: str = "general",
        temperature: float = 0.4,
        top_p: float = 0.9,
        confidence_threshold: float = 0.7,
        repeat_penalty: float = 1.1,
        num_thread: int = 8,
        max_history: int = 3
    )
The RAG engine supports two inference modes: general (fast, using qwen2.5-coder:1.5b) and reasoning (deeper analysis, using deepseek-r1:7b).

Exact Matching System

Quest implements a hash map-based exact matching system for O(1) lookup of known problems.

How It Works

def _build_exact_match_map(self) -> dict:
    """Build a hash map for exact match search."""
    exact_match_map = {}
    for solution in self.retriever.solutions:
        normalized_title = self._normalize_title(solution.title)
        exact_match_map[normalized_title] = solution
    return exact_match_map

def _normalize_title(self, title: str) -> str:
    """Normalize a title for exact match search."""
    return title.strip().lower()
Key Features:
  • Title normalization (lowercase, stripped whitespace)
  • Direct solution retrieval without LLM inference
  • Bypasses retrieval and generation for exact matches
When you know the exact problem title (e.g., “two sum”), Quest returns the solution instantly without using the LLM, saving time and compute resources.

Retrieval + Generation Pipeline

When exact matching fails, Quest uses a multi-stage retrieval and generation pipeline:

Step 1: Semantic Retrieval

retrieved_solutions = self.retriever.search(
    query, k=k, return_scores=True)
filtered_solutions = [
    sol for sol in retrieved_solutions 
    if hasattr(sol, 'score') and float(sol.score) >= min_confidence
]
The engine retrieves k similar solutions using FAISS and filters them based on a minimum confidence threshold.

Step 2: Confidence-Based Fallback

if not filtered_solutions and k < 5:
    return self.answer_question(
        query, 
        k=k + 2, 
        min_confidence=min_confidence - 0.1
    )
If no solutions meet the confidence threshold, Quest automatically expands the search by:
  • Increasing k by 2 (retrieving more candidates)
  • Reducing min_confidence by 0.1 (relaxing the threshold)

Step 3: Prompt Generation

The engine generates a structured prompt incorporating:
  • Conversation history
  • Retrieved solutions with confidence scores
  • Mode-specific instructions (general vs reasoning)
def generate_enhanced_prompt(self, query: str, context: List[Solution]) -> str:
    history_context = self.conversation_history.get_context()
    
    if self.mode == "reasoning":
        base_prompt = PromptTemplates.reasoning_prompt(query, context)
    else:
        base_prompt = PromptTemplates.general_prompt(query, context)
    
    enhanced_prompt = (
        f"Conversation History:\n{history_context}\n\n"
        f"Query: {query}\n\n"
        f"Context: {context}\n\n"
        f"Instruction: {base_prompt}"
    )
    return enhanced_prompt

Step 4: LLM Inference

Quest calls Ollama with the enhanced prompt and streams the response:
def call_ollama(self, prompt: str) -> str:
    model = self.reasoning_model if self.mode == "reasoning" else self.model_name
    
    payload = {
        "model": model,
        "prompt": prompt,
        "temperature": self.temperature,
        "top_p": self.top_p,
        "num_thread": self.num_thread,
        "repeat_penalty": self.repeat_penalty,
        "stream": True
    }
    
    response = requests.post(self.ollama_url, json=payload, stream=True)
    # Stream processing...

Confidence Thresholds

Quest uses confidence scores to ensure response quality:
ParameterDefault ValuePurpose
confidence_threshold0.7Global threshold for filtering
min_confidence0.6Per-query minimum confidence
How Confidence Works:
  1. Each retrieved solution has a similarity score (distance in vector space)
  2. Solutions below min_confidence are filtered out
  3. If no solutions pass, Quest expands the search automatically
  4. Sorted solutions appear in prompts with confidence scores
for idx, solution in enumerate(sorted_solutions):
    prompt += f"\n[{idx+1}] {solution.title} (Confidence: {solution.score:.2f}):\n{solution_text}\n"

Stopping and Resetting Generation

Quest supports canceling ongoing generation:
def stop(self):
    """Stop the ongoing generation process."""
    self.stop_generation = True
    logger.info("Generation process stopped.")

def reset(self):
    """Reset the stop flag to allow new generations."""
    self.stop_generation = False
    logger.info("Generation process reset.")
def answer_question(
    self,
    query: str,
    k: int = 5,
    min_confidence: float = 0.6
) -> str:
    """Answer a question using the enhanced RAG engine."""
    try:
        # Reset the stop flag
        self.reset()
        
        # Check for exact match
        normalized_query = self._normalize_title(query)
        if normalized_query in self.exact_match_map:
            exact_match_solution = self.exact_match_map[normalized_query]
            logger.info("Exact match found. Returning solution directly.")
            return f"Exact Match Solution:\n{exact_match_solution.solution}"
        
        # Retrieve relevant context
        retrieved_solutions = self.retriever.search(
            query, k=k, return_scores=True)
        filtered_solutions = [
            sol for sol in retrieved_solutions 
            if hasattr(sol, 'score') and float(sol.score) >= min_confidence
        ]
        
        # Fallback if no solutions meet confidence threshold
        if not filtered_solutions and k < 5:
            return self.answer_question(
                query, k=k + 2, min_confidence=min_confidence - 0.1)
        
        # Generate enhanced prompt
        prompt = self.generate_enhanced_prompt(query, filtered_solutions)
        
        # Get response from Ollama
        response = self.call_ollama(prompt)
        
        # Add to conversation history
        self.conversation_history.add_query(query, response)
        
        # Filter response if in reasoning mode
        if self.mode == "reasoning":
            response = self.filter_reasoning_response(response)
        
        return f"Generated Solution:\n{response}"
    except Exception as e:
        logger.error(f"Failed to answer question: {e}")
        return "An error occurred while generating the response."

Usage Example

from src.DSAAssistant.components.retriever2 import LeetCodeRetriever
from rag_engine import RAGEngine

# Initialize retriever and engine
retriever = LeetCodeRetriever()
rag_engine = RAGEngine(retriever, max_history=3)

# Set mode
rag_engine.set_mode("general")

# Ask a question
answer = rag_engine.answer_question(
    "How do I solve the two sum problem?", 
    k=3
)
print(answer)

Retrieval System

Learn about FAISS indexing and similarity search

Inference Modes

Understand general vs reasoning modes

Memory Buffer

Explore conversation history management

Build docs developers (and LLMs) love