RAG system

Architecture Overview

Flower Engine implements a dual-collection RAG (Retrieval-Augmented Generation) system using ChromaDB for persistent vector storage. The system maintains separate collections for world lore and session memory, enabling context-aware narrative generation with semantic search.

Core Components

RAGManager (engine/rag.py:9-128) orchestrates all vector operations:

Persistent disk-based ChromaDB client
SentenceTransformer embeddings (all-MiniLM-L6-v2)
Separate collections for lore and memory
HNSW indexing with cosine similarity

Initialization

The RAG system initializes on engine startup with automatic directory creation:

from engine.rag import rag_manager

# Singleton instance with default path
rag_manager = RagManager(db_path="./chroma_db")

Embedding Model

Flower uses all-MiniLM-L6-v2 from sentence-transformers:

384-dimensional embeddings
Fast CPU inference (~50ms per query)
Optimized for semantic similarity tasks
Installed automatically via requirements.txt:7

self.embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

Collection Architecture

World Lore Collection

Stores static world knowledge chunked for context efficiency:

@property
def collection(self) -> Collection:
    if self._collection is None:
        self._collection = self.client.get_or_create_collection(
            name="world_lore",
            embedding_function=self.embedding_function,
            metadata={"hnsw:space": "cosine"}
        )
    return self._collection

Lore Chunking Strategy (engine/main.py:41-59):

Maximum chunk size: 800 characters
Line-aware splitting (preserves paragraph integrity)
Automatic chunking on world asset load

# Lore is split into 800-char chunks during startup
if w.lore:
    chunks = []
    current_chunk = ""
    chunk_size = 800
    
    for line in w.lore.split('\n'):
        if len(current_chunk) + len(line) > chunk_size and current_chunk:
            chunks.append(current_chunk.strip())
            current_chunk = line + '\n'
        else:
            current_chunk += line + '\n'
    
    # Add each chunk to RAG with unique ID
    for i, chunk in enumerate(chunks):
        rag_manager.add_lore(w.id, f"base_lore_{i}", chunk)

Session Memory Collection

Stores recent conversation exchanges for context continuity:

@property
def memory_collection(self) -> Collection:
    if self._memory_collection is None:
        self._memory_collection = self.client.get_or_create_collection(
            name="session_memory",
            embedding_function=self.embedding_function,
            metadata={"hnsw:space": "cosine"}
        )
    return self._memory_collection

Adding Documents

Lore Insertion

World-scoped documents with automatic world_id tagging:

def add_lore(self, world_id: str, lore_id: str, text: str, metadata: Dict[str, Any] = None):
    """Add a document to the lore collection for a specific world."""
    meta = metadata or {}
    meta["world_id"] = world_id  # Ensures world filtering
    
    self.collection.upsert(
        ids=[f"{world_id}_{lore_id}"],
        documents=[text],
        metadatas=[meta]
    )

Usage:

rag_manager.add_lore(
    world_id="crimson_peaks",
    lore_id="mountain_lore_1",
    text="The Crimson Peaks were forged in dragon fire...",
    metadata={"category": "geography"}
)

Memory Insertion

Session-scoped conversation pairs stored after each AI response:

def add_memory(self, session_id: str, memory_id: str, text: str):
    """Add a recent exchange to the session memory collection."""
    self.memory_collection.upsert(
        ids=[f"{session_id}_{memory_id}"],
        documents=[text],
        metadatas=[{"session_id": session_id}]
    )

Real Implementation (engine/llm.py:244-247):

memory_key = f"{char_id}_{session_id}" if session_id else char_id
rag_manager.add_memory(
    memory_key, 
    str(uuid.uuid4()), 
    f"User: {prompt}\nAI: {full_content}"
)

Querying with Semantic Search

Lore Retrieval

Filtered by world ID with context window protection:

def query_lore(self, world_id: str, query: str, n_results: int = 3, max_chars: int = 1000) -> Tuple[List[str], bool]:
    """Query lore specifically for the given world. Returns (results, context_warning)."""
    try:
        results = self.collection.query(
            query_texts=[query],
            n_results=n_results,
            where={"world_id": world_id}  # World-scoped filter
        )
        
        if results["documents"] and results["documents"][0]:
            docs = results["documents"][0]
            
            # Check for context window bloat
            total_chars = sum(len(d) for d in docs)
            context_warning = total_chars > max_chars
            
            return docs, context_warning
        return [], False
    except Exception as e:
        log.error(f"Error querying lore: {e}")
        return [], False

Production Usage (engine/main.py:197-199):

# Retrieve 2 most relevant lore chunks
lore_list, _ = rag_manager.query_lore(
    state.ACTIVE_WORLD_ID, prompt, n_results=2
)

Memory Retrieval

Session-scoped with larger context allowance:

def query_memory(self, session_id: str, query: str, n_results: int = 3, max_chars: int = 1500) -> Tuple[List[str], bool]:
    """Query memory for the given session. Returns (results, context_warning)."""
    try:
        results = self.memory_collection.query(
            query_texts=[query],
            n_results=n_results,
            where={"session_id": session_id}
        )
        
        if results["documents"] and results["documents"][0]:
            docs = results["documents"][0]
            total_chars = sum(len(d) for d in docs)
            context_warning = total_chars > max_chars
            return docs, context_warning
        return [], False
    except Exception as e:
        log.error(f"Error querying memory: {e}")
        return [], False

Production Usage (engine/main.py:200-201):

mem_key = f"{state.ACTIVE_CHARACTER_ID}_{state.ACTIVE_SESSION_ID}"
mem_list, _ = rag_manager.query_memory(mem_key, prompt, n_results=3)

Context Integration Pipeline

The RAG system feeds into the LLM prompt construction:

# 1. Query both collections (engine/main.py:197-201)
lore_list, _ = rag_manager.query_lore(state.ACTIVE_WORLD_ID, prompt, n_results=2)
mem_key = f"{state.ACTIVE_CHARACTER_ID}_{state.ACTIVE_SESSION_ID}"
mem_list, _ = rag_manager.query_memory(mem_key, prompt, n_results=3)

# 2. Build context string (engine/main.py:217-219)
full_context = (
    f"--- RECENT MEMORY ---\n{chr(10).join(mem_list)}" if mem_list else ""
)

# 3. Pass to LLM streaming (engine/main.py:223-230)
stream_chat_response(
    websocket,
    prompt,
    full_context,  # Memory injected here
    state.ACTIVE_WORLD_ID,
    state.ACTIVE_CHARACTER_ID,
    state.ACTIVE_SESSION_ID
)

Memory Management

Session Cleanup

Physical deletion of embeddings when sessions end:

def delete_session_memory(self, session_id: str):
    """Physically delete all vector embeddings for a specific session."""
    try:
        self.memory_collection.delete(where={"session_id": session_id})
    except Exception as e:
        log.error(f"Failed to delete vector memory: {e}")

Triggered via /session delete <id> command.

Performance Characteristics

Embedding Speed

Model Load: ~2 seconds (first query only)
Query Latency: 30-50ms per search
Batch Embedding: ~100 docs/second

Storage

Disk Usage: ~1KB per document + embeddings
Index Type: HNSW (Hierarchical Navigable Small World)
Similarity Metric: Cosine distance

Context Limits

Lore: 1000 chars default (2 chunks × ~500 chars)
Memory: 1500 chars default (3 chunks × ~500 chars)
Total RAG Context: ~2500 chars typical

Debugging RAG Queries

Full retrieval logging is enabled in production (engine/main.py:203-214):

if lore_list:
    log.info(f"\n=== RETRIEVED LORE ({len(lore_list)} chunks) ===")
    for i, chunk in enumerate(lore_list):
        log.info(f"[LORE {i+1}]\n{chunk}\n")
    log.info(f"=== END LORE ===\n")

if mem_list:
    log.info(f"\n=== RETRIEVED MEMORY ({len(mem_list)} chunks) ===")
    for i, chunk in enumerate(mem_list):
        log.info(f"[MEMORY {i+1}]\n{chunk}\n")
    log.info(f"=== END MEMORY ===\n")

Monitor logs to verify semantic matching quality.

Advanced Configuration

Custom Embedding Models

Swap models by modifying engine/rag.py:20:

# Options:
# - "all-MiniLM-L6-v2" (default, 384 dim)
# - "all-mpnet-base-v2" (768 dim, higher quality)
# - "paraphrase-multilingual-MiniLM-L12-v2" (multilingual)

model_name = "all-mpnet-base-v2"
self.embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=model_name
)

Changing embedding models requires deleting chroma_db/ and re-indexing all content.

Database Path

Configure storage location via config.yaml:6:

database_path: "./chroma_db"  # Relative to project root

Or pass directly:

rag_manager = RagManager(db_path="/custom/path/chroma_db")

Collection Inspection

Query collection metadata programmatically:

# Check collection size
count = rag_manager.collection.count()
print(f"Total lore documents: {count}")

# Peek at all documents (development only)
results = rag_manager.collection.peek(limit=10)
for doc, meta in zip(results["documents"], results["metadatas"]):
    print(f"World: {meta['world_id']}")
    print(f"Content: {doc[:100]}...\n")

Best Practices

Chunk Wisely: 800 chars balances context and granularity
Filter Aggressively: Always use where clauses to scope queries
Monitor Context: Watch for context_warning flags
Clean Sessions: Delete old session memory to reduce index bloat
Log Retrievals: Keep RAG logging enabled during development

Common Issues

”No lore retrieved”

Verify world has lore in assets/worlds/<world>.yaml
Check world ID matches: world_id in metadata
Inspect ChromaDB with collection.peek()

”Memory not persisting”

Ensure session_id is consistent across requests
Memory is added AFTER AI response completes
Check chroma_db/ directory permissions

”Slow first query”

SentenceTransformer downloads model on first use
Subsequent queries use cached model
Pre-warm with dummy query: rag_manager.query_lore("test", "test")

Get Started

Core Concepts

Guides

Advanced

Architecture Overview

Core Components

Initialization

Embedding Model

Collection Architecture

World Lore Collection

Session Memory Collection

Adding Documents

Lore Insertion

Memory Insertion

Querying with Semantic Search

Lore Retrieval

Memory Retrieval

Context Integration Pipeline

Memory Management

Session Cleanup

Performance Characteristics

Embedding Speed

Storage

Context Limits

Debugging RAG Queries

Advanced Configuration

Custom Embedding Models

Database Path

Collection Inspection

Best Practices

Common Issues

”No lore retrieved”

”Memory not persisting”

”Slow first query”

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Architecture Overview

​Core Components

​Initialization

​Embedding Model

​Collection Architecture

​World Lore Collection

​Session Memory Collection

​Adding Documents

​Lore Insertion

​Memory Insertion

​Querying with Semantic Search

​Lore Retrieval

​Memory Retrieval

​Context Integration Pipeline

​Memory Management

​Session Cleanup

​Performance Characteristics

​Embedding Speed

​Storage

​Context Limits

​Debugging RAG Queries

​Advanced Configuration

​Custom Embedding Models

​Database Path

​Collection Inspection

​Best Practices

​Common Issues

​”No lore retrieved”

​”Memory not persisting”

​”Slow first query”

Build docs developers (and LLMs) love

Architecture Overview

Core Components

Initialization

Embedding Model

Collection Architecture

World Lore Collection

Session Memory Collection

Adding Documents

Lore Insertion

Memory Insertion

Querying with Semantic Search

Lore Retrieval

Memory Retrieval

Context Integration Pipeline

Memory Management

Session Cleanup

Performance Characteristics

Embedding Speed

Storage

Context Limits

Debugging RAG Queries

Advanced Configuration

Custom Embedding Models

Database Path

Collection Inspection

Best Practices

Common Issues

”No lore retrieved”

”Memory not persisting”

”Slow first query”