Knowledge Base

The knowledge base is the foundation of the RAG system. It stores support documentation as semantic embeddings in a Chroma vector database, enabling fast similarity search during answer generation.

Architecture Overview

The DocumentIngestor class orchestrates the offline ingestion pipeline:

Load - Parse documents using Unstructured API
Classify - Assign support category via LLM
Chunk - Split into retrieval-friendly segments
Embed - Generate semantic vectors with OpenAI
Store - Persist in Chroma with metadata

Ingestion is intentionally offline-only and does not run in production request paths.

Initialization

The ingestor requires OpenAI and Unstructured API keys:

class DocumentIngestor:
    """
    Offline document ingestion utility for the RAG knowledge base.
    """

    def __init__(
        self,
        collection_name: str = "docs_collection",
        persist_dir: str = "./chroma_db",
        unstructured_api_key: str | None = UNSTRUCTURED_API_KEY,
    ):
        """
        Initialize embeddings and vector store.
        """
        if not unstructured_api_key:
            raise ValueError("UNSTRUCTURED_API_KEY is required")

        if not OPENAI_API_KEY:
            raise ValueError("OPENAI_API_KEY is required")

        self.collection_name = collection_name
        self.persist_dir = persist_dir
        self.unstructured_api_key = unstructured_api_key

        self.embeddings = OpenAIEmbeddings(
            openai_api_key=OPENAI_API_KEY,
            model="text-embedding-3-small",
        )

        self.vectordb = Chroma(
            collection_name=self.collection_name,
            embedding_function=self.embeddings,
            persist_directory=self.persist_dir,
        )

The system uses OpenAI’s text-embedding-3-small model for cost-effective, high-quality embeddings.

Phase 1: Document Loading

Documents are parsed using the Unstructured API:

def load_document(self, file_path: str) -> List[Dict[str, str]]:
    """
    Load a document and extract raw elements.
    """
    path = Path(file_path)
    if not path.exists():
        raise FileNotFoundError(f"File not found: {file_path}")

    loader = UnstructuredLoader(
        file_path=str(path),
        api_key=self.unstructured_api_key,
        partition_via_api=True,
    )

    docs = loader.load()

    # Predict a single category for the entire document
    full_text = "\n".join(doc.page_content for doc in docs)
    category = predict_document_category(full_text)

    return [
        {
            "element_id": doc.metadata["element_id"],
            "content": doc.page_content,
            "filename": doc.metadata.get("filename", path.name),
            "category": category,
        }
        for doc in docs
    ]

The LLM-based predict_document_category() function assigns a canonical support category to each document.

Phase 2: Document Chunking

Large documents are split into overlapping chunks for better retrieval:

def chunk_documents(
    self,
    documents: List[Dict[str, str]],
    chunk_size: int = 500,
    chunk_overlap: int = 50,
) -> List[Dict[str, str]]:
    """
    Split documents into overlapping chunks for retrieval.

    Preserves:
        - filename
        - category
        - original element identity
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )

    chunked_docs: List[Dict[str, str]] = []

    for doc in documents:
        for chunk in splitter.split_text(doc["content"]):
            chunked_docs.append(
                {
                    "element_id": doc["element_id"],
                    "content": chunk,
                    "filename": doc["filename"],
                    "category": doc["category"],
                }
            )

    return chunked_docs

Default chunk size is 500 characters with 50-character overlap to preserve context across boundaries.

Why Chunking Matters

Retrieval Precision

Smaller chunks improve semantic match accuracy

Context Control

Overlaps prevent information loss at boundaries

LLM Efficiency

Reduces token usage in generation prompts

Citation Clarity

Enables precise source attribution

Phase 3: ID Normalization

Element IDs are normalized for cleaner citations:

def normalize_element_ids(
    self, chunked_docs: List[Dict[str, str]]
) -> List[Dict[str, str]]:
    """
    Convert original element IDs to sequential, deterministic IDs.

    Why:
    - Cleaner citations
    - Stable references across chunks
    """
    seen: Dict[str, str] = {}
    counter = 1

    for doc in chunked_docs:
        original_id = doc["element_id"]
        if original_id not in seen:
            seen[original_id] = str(counter)
            counter += 1

        doc["element_id"] = seen[original_id]

    return chunked_docs

Phase 4: Vector Storage

Chunks are embedded and stored in Chroma:

def store(self, chunked_docs: List[Dict[str, str]]) -> None:
    """
    Persist chunked documents into the vector store.
    """
    texts = [doc["content"] for doc in chunked_docs]
    metadatas = [
        {
            "element_id": doc["element_id"],
            "filename": doc["filename"],
            "category": doc["category"],
        }
        for doc in chunked_docs
    ]

    self.vectordb.add_texts(texts=texts, metadatas=metadatas)

Metadata (filename, category, element_id) is preserved for filtering and citation generation.

Example Usage

Ingesting a folder of markdown documents:

if __name__ == "__main__":
    ingestor = DocumentIngestor()

    kb_folder = "/path/to/kb_docs"
    md_files = glob(os.path.join(kb_folder, "*.md"))

    total_chunks = 0

    for file_path in md_files:
        logger.info(f"📄 Ingesting: {file_path}")

        docs = ingestor.load_document(file_path)
        chunked = ingestor.chunk_documents(docs)
        chunked = ingestor.normalize_element_ids(chunked)
        ingestor.store(chunked)

        total_chunks += len(chunked)
        logger.info(f"   → Stored {len(chunked)} chunks")

    logger.info(
        f"\n✅ Ingested {len(md_files)} documents | {total_chunks} total chunks"
    )

Retrieval at Query Time

The RAG agent performs filtered similarity search:

def retrieve(
    self,
    query: str,
    predicted_category: str,
    k: int = 5,
) -> List[Dict]:
    """
    Retrieve top-K relevant chunks from the vector store.
    """
    filters = {"category": predicted_category}

    results = self.vectordb.similarity_search_with_relevance_scores(
        query,
        k=k,
        filter=filters,
    )

    return [
        {
            "content": doc.page_content,
            "score": score,
            "metadata": doc.metadata,
        }
        for doc, score in results
    ]

Retrieval is filtered by the triage model’s predicted category to ensure domain relevance.

Storage Structure

Chroma Vector Store Schema

Each stored chunk contains:Text: The actual chunk content
Embedding: 1536-dimensional vector from OpenAI
Metadata:
element_id: Normalized sequential ID
filename: Source document name
category: Support category (Billing, Auth, etc.)

RAG Pipeline

See how retrieved chunks are used in generation

Structured Outputs

Learn how citations reference these chunks

Getting Started

Core Concepts

Guides

Deployment

Architecture Overview

Initialization

Phase 1: Document Loading

Phase 2: Document Chunking

Why Chunking Matters

Retrieval Precision

Context Control

LLM Efficiency

Citation Clarity

Phase 3: ID Normalization

Phase 4: Vector Storage

Example Usage

Retrieval at Query Time

Storage Structure

RAG Pipeline

Structured Outputs

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Deployment

​Architecture Overview

​Initialization

​Phase 1: Document Loading

​Phase 2: Document Chunking

​Why Chunking Matters

Retrieval Precision

Context Control

LLM Efficiency

Citation Clarity

​Phase 3: ID Normalization

​Phase 4: Vector Storage

​Example Usage

​Retrieval at Query Time

​Storage Structure

​Related Concepts

RAG Pipeline

Structured Outputs

Build docs developers (and LLMs) love

Architecture Overview

Initialization

Phase 1: Document Loading

Phase 2: Document Chunking

Why Chunking Matters

Phase 3: ID Normalization

Phase 4: Vector Storage

Example Usage

Retrieval at Query Time

Storage Structure

Related Concepts