Memory Buffer

Overview

Quest’s memory buffer system, implemented through the ConversationHistory class, maintains context across multiple queries. This enables follow-up questions, clarifications, and contextual responses without repeating information.

ConversationHistory Class

Architecture

class ConversationHistory:
    def __init__(self, max_history: int = 5):
        """
        Initialize the conversation history with a maximum limit.
        :param max_history: Maximum number of queries to retain in history.
        """
        self.max_history = max_history
        self.history: List[Dict[str, str]] = []

Key Features:

Bounded memory: Automatically limits history to prevent context overflow
Query-response pairs: Stores both user queries and system responses
FIFO eviction: Removes oldest entries when limit is reached

Data Structure

Each history entry is a dictionary:

{
    "query": "How do I solve the two sum problem?",
    "response": "Use a hash map to store complements..."
}

The full history is a list:

self.history = [
    {"query": "What is binary search?", "response": "..."},
    {"query": "How do I implement it?", "response": "..."},
    {"query": "What's the time complexity?", "response": "..."}
]

Core Methods

Adding Queries

def add_query(self, query: str, response: str):
    """
    Add a new query and response to the history.
    :param query: The user's query.
    :param response: The system's response.
    """
    self.history.append({"query": query, "response": response})
    if len(self.history) > self.max_history:
        # Remove the oldest query if history exceeds the limit
        self.history.pop(0)

Behavior:

Appends new query-response pair to history
Automatically removes oldest entry when max_history is exceeded
Uses FIFO (First In, First Out) eviction policy

The memory buffer ensures the most recent max_history conversations are always available, preventing token limits from being exceeded.

Retrieving Context

def get_context(self) -> str:
    """
    Generate a context string from the conversation history.
    :return: A formatted context string.
    """
    context = ""
    for entry in self.history:
        context += f"User: {entry['query']}\nSystem: {entry['response']}\n"
    return context.strip()

Output Format:

User: What is the two sum problem?
System: The two sum problem asks you to find two numbers in an array that add up to a target value...
User: How do I optimize it?
System: Use a hash map to achieve O(n) time complexity...
User: What about space complexity?
System: The hash map approach uses O(n) space in the worst case...

This context string is injected into prompts to provide conversation continuity.

Clearing History

def clear(self):
    """Clear the conversation history."""
    self.history = []

Clearing history removes all context, causing the next query to be treated as a fresh conversation.

Integration with RAG Engine

Initialization

The RAG engine creates a conversation history instance:

class RAGEngine:
    def __init__(
        self,
        retriever: LeetCodeRetriever,
        max_history: int = 3,
        # ... other parameters
    ):
        self.conversation_history = ConversationHistory(max_history)

Context Injection in Prompts

Conversation history is included in every prompt:

def generate_enhanced_prompt(self, query: str, context: List[Solution]) -> str:
    """Generate a structured prompt incorporating context and conversation history."""
    # Retrieve the conversation history
    history_context = self.conversation_history.get_context()
    
    # Generate the base prompt based on the mode
    if self.mode == "reasoning":
        base_prompt = PromptTemplates.reasoning_prompt(query, context)
    else:
        base_prompt = PromptTemplates.general_prompt(query, context)
    
    # Enhance the prompt with conversation history
    enhanced_prompt = (
        f"Conversation History:\n{history_context}\n\n"
        f"Query: {query}\n\n"
        f"Context: {context}\n\n"
        f"Instruction: {base_prompt}"
    )
    
    return enhanced_prompt

Prompt Structure:

Conversation History: Previous queries and responses
Current Query: The new question
Retrieved Context: Relevant solutions from FAISS
Instructions: Mode-specific system prompts

Automatic History Updates

After generating a response, the RAG engine updates history:

def answer_question(self, query: str, k: int = 5, min_confidence: float = 0.6) -> str:
    # ... retrieval and generation logic ...
    
    # Get response from Ollama
    response = self.call_ollama(prompt)
    
    # Add the response to the conversation history
    self.conversation_history.add_query(query, response)
    
    # Filter response if in reasoning mode
    if self.mode == "reasoning":
        response = self.filter_reasoning_response(response)
    
    return f"Generated Solution:\n{response}"

History is updated after response generation, ensuring the current query doesn’t appear in its own context.

Max History Limits

The max_history parameter controls memory usage and context window size:

Choosing max_history

max_history	Use Case	Context Length	Memory
1-2	Minimal context, fast inference	~200-400 tokens	Low
3-5	Balanced (default)	~600-1000 tokens	Medium
10+	Long conversations, complex queries	~2000+ tokens	High

Quest uses max_history=3 by default, providing enough context for follow-up questions without overwhelming the model.

Token Budget Considerations

Each history entry adds tokens to the prompt:

Query: ~20-50 tokens
Response: ~100-300 tokens
Total per entry: ~120-350 tokens

For max_history=3:

Minimum: 360 tokens (3 × 120)
Maximum: 1050 tokens (3 × 350)
Average: ~600 tokens

If using models with small context windows (e.g., 2048 tokens), limit max_history to avoid exceeding the context limit when combined with retrieved solutions and system prompts.

Usage Examples

Basic Conversation Flow

from rag_engine import RAGEngine
from src.DSAAssistant.components.retriever2 import LeetCodeRetriever

# Initialize with max_history=3
retriever = LeetCodeRetriever()
rag_engine = RAGEngine(retriever, max_history=3)
rag_engine.set_mode("general")

# Query 1
answer1 = rag_engine.answer_question("What is the two sum problem?")
print(answer1)

# Query 2 (references previous context)
answer2 = rag_engine.answer_question("How do I optimize it?")
print(answer2)

# Query 3 (builds on previous two)
answer3 = rag_engine.answer_question("What about space complexity?")
print(answer3)

# View conversation history
print("\nConversation History:")
print(rag_engine.conversation_history.get_context())

Checking History Length

# Get current history length
history_len = len(rag_engine.conversation_history.history)
print(f"Current history length: {history_len}")

# Max history setting
max_history = rag_engine.conversation_history.max_history
print(f"Max history: {max_history}")

# Check if history is at capacity
if history_len >= max_history:
    print("History is full - oldest entries will be evicted")

Clearing History Mid-Conversation

# Have a conversation
rag_engine.answer_question("What is binary search?")
rag_engine.answer_question("How do I implement it?")

# Clear history to start fresh
rag_engine.conversation_history.clear()

# Next query has no context
rag_engine.answer_question("What is merge sort?")
# This will be treated as a new conversation

Custom History Limits

# Long conversation mode (10 entries)
rag_engine_long = RAGEngine(retriever, max_history=10)

# Minimal context mode (1 entry)
rag_engine_minimal = RAGEngine(retriever, max_history=1)

# No history (stateless)
rag_engine_stateless = RAGEngine(retriever, max_history=0)

Context Usage Patterns

Follow-up Questions

Example: Two Sum Problem

Query 1:

User: Explain the two sum problem
System: The two sum problem involves finding two numbers in an array that add up to a target...

Query 2 (uses context):

User: Why is a hash map useful for this?
System: [Sees previous query about two sum] A hash map allows O(1) lookups, making the solution efficient...

Query 3 (uses both previous):

User: What if the array has duplicates?
System: [Knows we're discussing two sum with hash map] With duplicates, the hash map approach still works because...

Clarifications

Example: Algorithm Clarification

Query 1:

User: How does DFS work?
System: Depth-First Search explores as far as possible along each branch...

Query 2:

User: Can you show me the code?
System: [Knows user asked about DFS] Here's a DFS implementation:
[code]

Multi-turn Problem Solving

Example: Step-by-step Solution

Query 1:

User: I need to find the shortest path in a graph
System: You can use BFS for unweighted graphs or Dijkstra's algorithm for weighted graphs...

Query 2:

User: Let's use BFS. How do I implement it?
System: [Remembers shortest path context] For BFS in your shortest path problem:
[code]

Query 3:

User: What's the time complexity?
System: [Knows we're discussing BFS for shortest path] BFS runs in O(V + E) time...

Performance Implications

Memory Usage

# Typical memory per history entry
entry_size = len(query) + len(response)  # Characters
avg_entry = 400  # ~400 characters average

# For max_history=3
total_memory = 3 × 400 = 1200 characters (~1.2 KB)

Conversation history has minimal memory overhead - even with max_history=10, total memory usage is typically under 5 KB.

Inference Impact

Longer context increases inference time:

max_history	Avg Context Tokens	Inference Time Impact
0	0	Baseline
3	600	+10-15%
5	1000	+20-25%
10	2000	+40-50%

For latency-sensitive applications, use max_history=3 or lower. For complex problem-solving where context is critical, increase to 5-10.

Edge Cases and Best Practices

Empty History

# On first query, history is empty
context = rag_engine.conversation_history.get_context()
print(context)  # Returns empty string ""

The RAG engine handles this gracefully - prompts still work without history.

Very Long Responses

If responses are long (500+ tokens), history fills up quickly:

# Reduce max_history for verbose responses
rag_engine = RAGEngine(retriever, max_history=2)  # Instead of 3

Stateless Mode

# Set max_history=0 for stateless operation
rag_engine = RAGEngine(retriever, max_history=0)

# Every query is independent
rag_engine.answer_question("Query 1")
rag_engine.answer_question("Query 2")  # No context from Query 1

RAG Engine

How memory buffer integrates with the RAG pipeline

Inference Modes

Memory buffer works with both general and reasoning modes

Get Started

Core Concepts

Guides

Configuration

Memory Buffer

Overview

ConversationHistory Class

Architecture

Data Structure

Core Methods

Adding Queries

Retrieving Context

Clearing History

Integration with RAG Engine

Initialization

Context Injection in Prompts

Automatic History Updates

Max History Limits

Choosing max_history

Token Budget Considerations

Usage Examples

Basic Conversation Flow

Checking History Length

Clearing History Mid-Conversation

Custom History Limits

Context Usage Patterns

Follow-up Questions

Clarifications

Multi-turn Problem Solving

Performance Implications

Memory Usage

Inference Impact

Edge Cases and Best Practices

Empty History

Very Long Responses

Stateless Mode

RAG Engine

Inference Modes

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

​Overview

​ConversationHistory Class

​Architecture

​Data Structure

​Core Methods

​Adding Queries

​Retrieving Context

​Clearing History

​Integration with RAG Engine

​Initialization

​Context Injection in Prompts

​Automatic History Updates

​Max History Limits

​Choosing max_history

​Token Budget Considerations

​Usage Examples

​Basic Conversation Flow

​Checking History Length

​Clearing History Mid-Conversation

​Custom History Limits

​Context Usage Patterns

​Follow-up Questions

​Clarifications

​Multi-turn Problem Solving

​Performance Implications

​Memory Usage

​Inference Impact

​Edge Cases and Best Practices

​Empty History

​Very Long Responses

​Stateless Mode

​Related Documentation

RAG Engine

Inference Modes

Build docs developers (and LLMs) love

Overview

ConversationHistory Class

Architecture

Data Structure

Core Methods

Adding Queries

Retrieving Context

Clearing History

Integration with RAG Engine

Initialization

Context Injection in Prompts

Automatic History Updates

Max History Limits

Choosing max_history

Token Budget Considerations

Usage Examples

Basic Conversation Flow

Checking History Length

Clearing History Mid-Conversation

Custom History Limits

Context Usage Patterns

Follow-up Questions

Clarifications

Multi-turn Problem Solving

Performance Implications

Memory Usage

Inference Impact

Edge Cases and Best Practices

Empty History

Very Long Responses

Stateless Mode

Related Documentation