Overview
RCLI’s RAG system combines vector search (HNSW) and BM25 full-text search for hybrid retrieval. Ingest documents, build an index, then query with LLM-generated responses grounded in your data. Key features:- Hybrid retrieval: Vector embeddings + BM25 keyword search
- Fast indexing: 32-batch embedding with progress callbacks
- Low latency: ~4ms retrieval over 5K+ chunks
- LRU embedding cache: 99.9% hit rate
- Supports:
.txt,.md,.pdf,.docx,.html
Workflow
- Ingest:
rcli_rag_ingest()- Process documents and build index - Load:
rcli_rag_load_index()- Load existing index at startup - Query:
rcli_rag_query()- Retrieve context + LLM response - Clear:
rcli_rag_clear()- Unload index from memory
rcli_rag_ingest
Ingest documents from a directory and build a RAG index.Engine handle (must be initialized)
Path to directory containing documents. Scans recursively.Supported formats:
.txt, .md, .pdf, .docx, .html0: Ingestion succeeded-1: Failed (missing embedding model, invalid path, etc.)
Example
Ingestion automatically loads the index for querying after building.
Index Location
By default, the index is saved to:- macOS:
~/Library/RCLI/index/ - Fallback:
/tmp/rcli_index/
vector.usearch- HNSW vector indexbm25.json- BM25 term frequencieschunks.json- Document chunks with metadata
Progress Callback
The implementation shows progress via stderr:Document Processing
- Chunking: Documents split into 512-token chunks with 50-token overlap
- Embedding: Snowflake Arctic Embed S (384-dim)
- Batch size: 32 chunks per embedding batch
- Metadata: Filename, chunk index, token count preserved
rcli_rag_load_index
Load a previously-built RAG index for querying.Engine handle (must be initialized)
Path to directory containing the RAG index files (
vector.usearch, bm25.json, chunks.json)0: Index loaded successfully-1: Failed (missing files, corrupted index, etc.)
Example: Startup Loading
Call this once at startup, not per query. The index stays loaded until
rcli_rag_clear() or rcli_destroy().Requirements
The embedding model must be present:models/snowflake-arctic-embed-s-q8_0.gguf(34 MB)- Downloaded via
rcli setuporscripts/download_models.sh
rcli_rag_query
Query the RAG system: retrieve relevant chunks + generate LLM response.Engine handle (must have loaded RAG index)
User query text
LLM response grounded in retrieved context. Empty string if RAG not loaded. Do not free - owned by the engine.
Example
How It Works
- Embed query: Convert query text to 384-dim vector (~5ms)
- Hybrid retrieval:
- Vector search: HNSW nearest neighbors
- BM25 search: Keyword matching
- Fusion: Reciprocal Rank Fusion (RRF) to merge results
- Retrieve top-5 chunks (~4ms total)
- Build context: Concatenate retrieved chunks
- LLM generation: Generate response with context in system prompt
Performance
| Operation | Latency | Notes |
|---|---|---|
| Query embedding | ~5ms | Cached for repeated queries |
| Retrieval (5 chunks) | ~4ms | HNSW + BM25 + fusion |
| LLM generation | ~500ms | Depends on response length |
| Total | ~510ms | End-to-end RAG query |
Retrieval Parameters
Current implementation:- Top-K: 5 chunks
- Vector weight: 0.5
- BM25 weight: 0.5
- Max chunk tokens: 512
RAG queries do NOT update conversation history. Use
rcli_process_command() for multi-turn conversations.rcli_rag_clear
Clear the RAG index from memory (unload retriever + embeddings).Engine handle
Example
After calling
rcli_rag_clear(), queries revert to plain LLM mode. Reload the index with rcli_rag_load_index() to re-enable RAG.Complete Example: RAG CLI
Advanced: Custom Index Path
Store the index path in the engine during ingestion:Troubleshooting
”Embedding model not found"
"Failed to load RAG index”
Check that the index directory contains:vector.usearchbm25.jsonchunks.json
”No results from RAG query”
- Check if index is loaded: Re-run
rcli_rag_load_index() - Verify documents were ingested: Check index file timestamps
- Try broader query terms
See Also
- Benchmarks - RAG performance testing
- State Management - Check RAG readiness
- RCLI CLI:
rcli rag ingest <dir>andrcli rag query <text>