Skip to main content

Overview

The RAGEngine class is the core component of Quest that orchestrates the retrieval-augmented generation process. It integrates the LeetCodeRetriever for semantic search, conversation history management, and Ollama API for response generation.

Class Definition

from src.DSAAssistant.components.retriever2 import LeetCodeRetriever
from src.DSAAssistant.components.memory_buffer import ConversationHistory
from src.DSAAssistant.components.prompt_temp import PromptTemplates

rag_engine = RAGEngine(
    retriever=retriever,
    ollama_url="http://localhost:11434/api/generate",
    model_name="qwen2.5-coder:1.5b",
    reasoning_model="deepseek-r1:7b",
    mode="general",
    temperature=0.4,
    top_p=0.9,
    confidence_threshold=0.7,
    repeat_penalty=1.1,
    num_thread=8,
    max_history=3
)

Constructor Parameters

retriever
LeetCodeRetriever
required
Instance of LeetCodeRetriever for semantic search of LeetCode solutions
ollama_url
str
default:"http://localhost:11434/api/generate"
URL endpoint for the Ollama API server
model_name
str
default:"qwen2.5-coder:1.5b"
Default model to use for general mode responses
reasoning_model
str
default:"deepseek-r1:7b"
Model to use for reasoning mode responses with advanced problem-solving
mode
str
default:"general"
Initial mode for the engine. Options: "general" or "reasoning"
temperature
float
default:"0.4"
Controls randomness in response generation. Lower values make output more deterministic
top_p
float
default:"0.9"
Nucleus sampling parameter for response diversity
confidence_threshold
float
default:"0.7"
Minimum confidence score for filtering retrieved solutions
repeat_penalty
float
default:"1.1"
Penalty for repeating tokens in generated responses
num_thread
int
default:"8"
Number of threads for Ollama inference
max_history
int
default:"3"
Maximum number of conversation turns to retain in history

Methods

answer_question

Answer a user query using the enhanced RAG pipeline with retrieval, filtering, and generation.
response = rag_engine.answer_question(
    query="How do I solve the Two Sum problem?",
    k=5,
    min_confidence=0.6
)
query
str
required
The user’s question or problem description
k
int
default:"5"
Number of solutions to retrieve from the vector database
min_confidence
float
default:"0.6"
Minimum confidence score threshold for filtering retrieved solutions. Solutions below this threshold are excluded.
response
str
Generated answer with solution details. Returns either:
  • "Exact Match Solution:\n{solution}" if an exact title match is found
  • "Generated Solution:\n{response}" if generated from retrieved context
  • Error message if generation fails
Behavior:
  • First checks for exact title match using normalized query
  • If no exact match, retrieves top-k similar solutions
  • Filters solutions by confidence threshold
  • If insufficient results, recursively retries with increased k and lowered threshold
  • Adds query and response to conversation history
  • In reasoning mode, filters out <think> blocks from response
Example:
retriever = LeetCodeRetriever()
rag_engine = RAGEngine(retriever, max_history=3)

# Basic query
answer = rag_engine.answer_question(
    "What is the Two Sum problem?"
)
print(answer)
# Output: Generated Solution:
# The Two Sum problem involves finding two numbers...

# Query with custom parameters
answer = rag_engine.answer_question(
    "Dynamic programming coin change",
    k=10,
    min_confidence=0.7
)

set_mode

Switch between general and reasoning modes. Reasoning mode uses a specialized model for deeper analysis.
rag_engine.set_mode("reasoning")
mode
str
required
Mode to set. Must be either "general" or "reasoning"
Raises:
  • ValueError if mode is not “general” or “reasoning”
Example:
# Switch to reasoning mode for complex problems
rag_engine.set_mode("reasoning")
answer = rag_engine.answer_question("Explain dynamic programming")

# Switch back to general mode
rag_engine.set_mode("general")

stop

Stop the ongoing generation process immediately. Returns the partial response generated so far.
rag_engine.stop()
Behavior:
  • Sets internal stop_generation flag to True
  • Current streaming response terminates gracefully
  • Partial response is returned from answer_question()
Example:
import threading

# In one thread
response = rag_engine.answer_question("Long query...")

# In another thread (e.g., user clicks stop button)
def on_stop_button():
    rag_engine.stop()

reset

Reset the stop flag to allow new generation requests. Automatically called at the start of answer_question().
rag_engine.reset()
Behavior:
  • Sets stop_generation flag to False
  • Enables new generation requests after a previous stop

generate_enhanced_prompt

Generate a structured prompt incorporating retrieved context and conversation history.
prompt = rag_engine.generate_enhanced_prompt(
    query="How to solve Two Sum?",
    context=retrieved_solutions
)
query
str
required
User’s question or problem description
context
List[Solution]
required
List of Solution objects retrieved from the vector database
prompt
str
Formatted prompt string combining conversation history, query, context, and mode-specific instructions
Behavior:
  • Retrieves conversation history via conversation_history.get_context()
  • Selects prompt template based on current mode (general or reasoning)
  • Combines history, query, context, and instructions into enhanced prompt
Example:
retriever = LeetCodeRetriever()
results = retriever.search("Two Sum", k=3)

prompt = rag_engine.generate_enhanced_prompt(
    query="Explain the Two Sum approach",
    context=results
)
print(prompt)
# Output:
# Conversation History:
# User: Previous query...
# System: Previous response...
#
# Query: Explain the Two Sum approach
# Context: [Solution objects]
# Instruction: [Mode-specific template]

call_ollama

Send a prompt to the Ollama API with streaming, error handling, and retry logic.
response = rag_engine.call_ollama(
    prompt="Explain dynamic programming"
)
prompt
str
required
Complete prompt to send to the Ollama model
response
str
Generated text response from the model. Returns error messages if API calls fail after retries.
Behavior:
  • Retries up to 3 times on failure
  • Selects model based on current mode (reasoning_model or model_name)
  • Streams response in real-time
  • Respects stop_generation flag for early termination
  • Returns partial response if stopped mid-generation
API Payload:
{
  "model": "qwen2.5-coder:1.5b",
  "prompt": "...",
  "temperature": 0.4,
  "top_p": 0.9,
  "num_thread": 8,
  "repeat_penalty": 1.1,
  "stream": true
}
Error Handling:
  • Logs warnings for non-200 status codes
  • Retries on exceptions
  • Returns "Error generating response after multiple attempts." after 3 failed attempts

filter_reasoning_response

Filter out the <think> reasoning block from DeepSeek model responses.
filtered = rag_engine.filter_reasoning_response(
    response="<think>reasoning steps</think>Final answer"
)
print(filtered)  # Output: "Final answer"
response
str
required
Raw response from the reasoning model
filtered_response
str
Response with <think> blocks removed. Returns original response if no <think> block found.
Behavior:
  • Detects <think>...</think> blocks in response
  • Extracts and returns content after </think> tag
  • Preserves original response if no thinking block exists
Example:
raw_response = """<think>
Let me analyze this step by step:
1. The problem requires O(n) time
2. We can use a hash map
</think>

Use a hash map to store complements for O(n) lookup."""

filtered = rag_engine.filter_reasoning_response(raw_response)
print(filtered)
# Output: "Use a hash map to store complements for O(n) lookup."

Internal Methods

_build_exact_match_map

Builds a hash map for O(1) exact title lookup. Returns: Dictionary mapping normalized titles to Solution objects

_normalize_title

Normalizes a title for case-insensitive exact matching.
title = rag_engine._normalize_title("Two Sum")
print(title)  # Output: "two sum"

Attributes

retriever
LeetCodeRetriever
Instance of the retriever for semantic search
conversation_history
ConversationHistory
Manages conversation context with max_history limit
exact_match_map
dict
Hash map for exact title matching (normalized title → Solution)
stop_generation
bool
Flag to control generation termination

Usage Example

from src.DSAAssistant.components.retriever2 import LeetCodeRetriever
from rag_engine import RAGEngine

# Initialize components
retriever = LeetCodeRetriever()
rag_engine = RAGEngine(retriever, max_history=3)

# Set mode
rag_engine.set_mode("general")

# Ask questions
answer = rag_engine.answer_question(
    "How do I solve the Two Sum problem?"
)
print(answer)

# View conversation history
history = rag_engine.conversation_history.get_context()
print(history)

Mode Comparison

General Mode

  • Uses model_name (default: qwen2.5-coder:1.5b)
  • Faster responses
  • Concise answers with code
  • Best for straightforward queries

Reasoning Mode

  • Uses reasoning_model (default: deepseek-r1:7b)
  • Detailed step-by-step analysis
  • Filters <think> blocks
  • Best for complex problem-solving

Error Handling

The engine implements comprehensive error handling:
  • API Failures: Retries up to 3 times with exponential backoff
  • Low Confidence: Automatically adjusts k and min_confidence
  • No Results: Returns helpful error messages
  • Generation Errors: Logs details and returns user-friendly messages
try:
    answer = rag_engine.answer_question("complex query")
except Exception as e:
    print(f"Error: {e}")
    # Engine handles internally and returns error message

See Also

Build docs developers (and LLMs) love