Prerequisites
Before ingesting documents, ensure you have:OPENAI_API_KEYset in your.envfileUNSTRUCTURED_API_KEYset in your.envfile- Documents in Markdown (
.md) format
CLI Method (Batch Ingestion)
The CLI method processes all Markdown files in thekb_docs/ folder in a single operation.
Run the ingestion command
- Load each document using the Unstructured API
- Classify the document into a support category
- Split content into chunks (default: 500 characters with 50 character overlap)
- Store embeddings and metadata in Chroma
API Method (Single File)
The API method allows you to ingest one document at a time via HTTP endpoint.Send ingestion request
Use Note: The filepath must be an absolute path to the document.
curl or any HTTP client to POST to the /ingest endpoint:How Ingestion Works
The ingestion pipeline follows these steps (seesrc/rag/ingest.py:32):
- Load Document: Uses Unstructured API to parse the Markdown file
- Classify Category: Predicts a support category for the entire document
- Chunk Content: Splits text using
RecursiveCharacterTextSplitter - Normalize IDs: Assigns sequential element IDs for stable references
- Store in Chroma: Persists chunks with metadata (filename, category, element_id)
Configuration Options
You can customize chunking behavior by modifying thechunk_documents method parameters (see src/rag/ingest.py:126):
chunk_size: Maximum characters per chunk (default: 500)chunk_overlap: Overlapping characters between chunks (default: 50)
Storage Location
Ingested documents are stored in the./chroma_db/ directory with collection name docs_collection.
Troubleshooting
FileNotFoundError: File not found
FileNotFoundError: File not found
Ensure the file path is correct and the file exists. For API ingestion, use absolute paths.
UNSTRUCTURED_API_KEY is required
UNSTRUCTURED_API_KEY is required
Add your Unstructured API key to the
.env file:OPENAI_API_KEY is required
OPENAI_API_KEY is required
Add your OpenAI API key to the
.env file for embeddings: