Skip to main content

Overview

RCLI’s RAG (Retrieval-Augmented Generation) system allows you to index local documents and query them using natural language. The system combines vector search with BM25 full-text search using Reciprocal Rank Fusion for optimal retrieval accuracy.

Hybrid Search

Vector + BM25 + RRF fusion

4ms Retrieval

Near-instant search over 5K+ chunks

On-Device

100% local, no external API calls

Supported File Types

RCLI can ingest and index the following document formats:
  • PDF - Portable Document Format
  • DOCX - Microsoft Word documents
  • TXT - Plain text files
  • MD - Markdown files
All document processing happens locally using native parsers. No cloud services or OCR required.

How It Works

1

Document Ingestion

Run rcli rag ingest <directory> to index documents:
rcli rag ingest ~/Documents/notes
RCLI recursively scans the directory, extracts text, and splits it into 512-token chunks with 50-token overlap.
2

Embedding Generation

Each chunk is embedded using Snowflake Arctic Embed S (34 MB, 384 dimensions):
// From src/engines/embedding_engine.cpp
std::vector<float> embed(const std::string& text) {
    // Tokenize text
    auto tokens = tokenize(text);
    // Run through embedding model
    return llama_embed(tokens);
}
3

Index Building

Three indices are built:
  • Vector Index: USearch HNSW for semantic search
  • BM25 Index: Inverted index for keyword search
  • Chunk Store: mmap’d binary file for fast text retrieval
4

Query Processing

When you query, RCLI:
  1. Embeds your query
  2. Searches vector index for top-k semantic matches
  3. Searches BM25 index for top-k keyword matches
  4. Fuses results using Reciprocal Rank Fusion (RRF)
  5. Retrieves full text of top chunks
  6. Passes context to LLM for answer generation

Hybrid Retrieval

RCLI combines two complementary search methods:

Vector Search (Semantic)

  • Algorithm: HNSW (Hierarchical Navigable Small World)
  • Library: USearch v2.16.5
  • Metric: Cosine similarity
  • Candidates: 10 (configurable)
Strengths: Captures semantic meaning, handles synonyms and paraphrasing
// From src/rag/vector_index.cpp:50-75
std::vector<SearchResult> search(const float* query_vec, int k) {
    auto results = usearch_index_.search(query_vec, k);
    return convert_to_search_results(results);
}

BM25 (Keyword)

  • Algorithm: Best Matching 25 (BM25)
  • Parameters: k1=1.5, b=0.75 (standard values)
  • Candidates: 10 (configurable)
Strengths: Exact keyword matching, acronyms, technical terms
// From src/rag/bm25_index.cpp:100-145
std::vector<SearchResult> search(const std::string& query, int k) {
    auto tokens = tokenize(query);
    std::vector<float> scores(num_docs, 0.0f);
    
    for (const auto& token : tokens) {
        // TF-IDF scoring with BM25 formula
        for (auto [doc_id, tf] : inverted_index_[token]) {
            float idf = log((N - df[token] + 0.5) / (df[token] + 0.5));
            scores[doc_id] += idf * (tf * (k1 + 1)) / 
                             (tf + k1 * (1 - b + b * doc_len[doc_id] / avg_doc_len));
        }
    }
    
    return top_k(scores, k);
}

Reciprocal Rank Fusion (RRF)

Combines vector and BM25 results using rank-based scoring:
// From src/rag/hybrid_retriever.cpp:150-200
std::vector<SearchResult> fuse_results(
    const std::vector<SearchResult>& vector_results,
    const std::vector<SearchResult>& bm25_results,
    float rrf_k = 60.0f
) {
    std::unordered_map<uint32_t, float> scores;
    
    // Score from vector search
    for (size_t i = 0; i < vector_results.size(); i++) {
        scores[vector_results[i].chunk_id] += 1.0f / (rrf_k + i + 1);
    }
    
    // Score from BM25 search
    for (size_t i = 0; i < bm25_results.size(); i++) {
        scores[bm25_results[i].chunk_id] += 1.0f / (rrf_k + i + 1);
    }
    
    // Sort by fused score
    return sort_by_score(scores);
}
RRF gives higher weight to documents that appear in both result sets, improving precision.

Index Structure

The RAG index is stored in ~/Library/RCLI/index/ with the following files:
~/Library/RCLI/index/
├── vectors.usearch       # HNSW vector index (USearch binary)
├── bm25.bin             # BM25 inverted index
├── chunks.bin            # mmap'd chunk text (fast retrieval)
├── chunks_meta.bin       # Chunk metadata (offsets, lengths)
└── embeddings.bin        # Cached embeddings (optional)

Chunk Store (mmap)

Chunk text is stored in a memory-mapped binary file for zero-copy retrieval:
// From src/rag/hybrid_retriever.cpp:26-52
bool load_chunk_store(const std::string& path) {
    chunk_store_fd_ = open(path.c_str(), O_RDONLY);
    struct stat st;
    fstat(chunk_store_fd_, &st);
    chunk_store_size_ = st.st_size;
    
    chunk_store_ = static_cast<char*>(
        mmap(nullptr, chunk_store_size_, PROT_READ, 
             MAP_PRIVATE, chunk_store_fd_, 0));
    
    // Advise sequential access for prefetching
    madvise(chunk_store_, chunk_store_size_, MADV_SEQUENTIAL);
    
    return chunk_store_ != MAP_FAILED;
}
Using mmap() allows the OS to page in chunk text on-demand without loading the entire store into RAM.

Document Processor

The document processor extracts text from various formats:
Uses pdftotext (from poppler-utils) to extract plain text:
// From src/rag/document_processor.cpp:50-75
std::string extract_pdf(const std::string& path) {
    std::string cmd = "pdftotext -enc UTF-8 " + 
                     shell_quote(path) + " -";
    FILE* pipe = popen(cmd.c_str(), "r");
    return read_all(pipe);
}

Chunking Strategy

  • Chunk Size: 512 tokens (~2000 characters)
  • Overlap: 50 tokens (~200 characters)
  • Preserves: Sentence boundaries (uses SentenceDetector)
// From src/rag/document_processor.cpp:120-180
std::vector<Chunk> chunk_text(const std::string& text) {
    std::vector<Chunk> chunks;
    size_t start = 0;
    
    while (start < text.size()) {
        size_t end = start + 2000;  // ~512 tokens
        
        // Extend to sentence boundary
        if (end < text.size()) {
            while (end < text.size() && text[end] != '.' && 
                   text[end] != '!' && text[end] != '?') {
                end++;
            }
            end++;  // Include punctuation
        }
        
        chunks.push_back({start, end, text.substr(start, end - start)});
        start = end - 200;  // 50-token overlap
    }
    
    return chunks;
}

Performance

Retrieval

3.82 ms hybrid search

Embedding

12 ms per chunk (384-dim)

Indexing

~200 docs/sec ingestion

Embedding Cache

RCLI caches embeddings with LRU eviction:
// From src/engines/embedding_engine.cpp:100-130
class EmbeddingCache {
    std::unordered_map<std::string, std::vector<float>> cache_;
    size_t max_size_ = 1000;
    
public:
    std::optional<std::vector<float>> get(const std::string& text) {
        auto it = cache_.find(text);
        if (it != cache_.end()) {
            return it->second;  // Cache hit
        }
        return std::nullopt;    // Cache miss
    }
};
Hit Rate: ~99.9% for repeated queries

Usage Examples

Index Documents

# Index a directory recursively
rcli rag ingest ~/Documents/notes

# Index specific files
rcli rag ingest ~/Research/paper.pdf

# Check index status
rcli rag status

Query Documents

# Query via CLI
rcli rag query "What were the key decisions in the last meeting?"

# Query with text command (uses active index)
rcli ask --rag ~/Library/RCLI/index "Summarize the project timeline"

Drag-and-Drop Indexing (TUI)

In the TUI, drag a file or folder from Finder into the terminal window. RCLI automatically:
  1. Detects the drop event
  2. Indexes the file/folder
  3. Loads the index for immediate querying
This is the fastest way to query a new document - just drag, drop, and ask.

Configuration

RAG parameters can be tuned via environment variables or config:
export RCLI_RAG_VECTOR_CANDIDATES=10     # Top-k from vector search
export RCLI_RAG_BM25_CANDIDATES=10       # Top-k from BM25
export RCLI_RAG_RRF_K=60.0               # RRF fusion parameter
export RCLI_RAG_CHUNK_SIZE=512           # Tokens per chunk
export RCLI_RAG_CHUNK_OVERLAP=50         # Overlap tokens
  • vector_candidates: Increase for better recall, decrease for speed
  • bm25_candidates: Increase for more keyword matches
  • rrf_k: Higher values (e.g., 100) give more weight to top-ranked results
  • chunk_size: Smaller chunks improve precision, larger improve context
  • chunk_overlap: Higher overlap prevents splitting sentences

Benchmarking

Test RAG performance with:
# Benchmark RAG retrieval
rcli bench --suite rag

# Sample output:
# === RAG Benchmark ===
# Documents indexed: 5,432
# Total chunks: 12,876
# Vector search: 2.1 ms (avg over 100 runs)
# BM25 search: 1.5 ms
# RRF fusion: 0.22 ms
# Total retrieval: 3.82 ms

Troubleshooting

Install poppler-utils:
brew install poppler
Large document collections may exceed available RAM. Try:
  • Index in smaller batches
  • Increase chunk size to reduce total chunks
  • Close other applications
Tune retrieval parameters:
  • Increase vector_candidates and bm25_candidates to 20
  • Adjust rrf_k to favor top-ranked results
  • Re-index with smaller chunk size for better precision
The vector index is mmap’d but initial load requires reading metadata. For large indices (>100K chunks), expect 500ms-2s load time.

Next Steps

RAG Commands

Complete command reference for ingestion and querying

RAG API

Embed RAG in your own applications

Configuration

Tune RAG parameters for your use case

Performance

Understand retrieval benchmarks

Build docs developers (and LLMs) love