Skip to main content

Overview

RCLI’s RAG system combines vector search (HNSW) and BM25 full-text search for hybrid retrieval. Ingest documents, build an index, then query with LLM-generated responses grounded in your data. Key features:
  • Hybrid retrieval: Vector embeddings + BM25 keyword search
  • Fast indexing: 32-batch embedding with progress callbacks
  • Low latency: ~4ms retrieval over 5K+ chunks
  • LRU embedding cache: 99.9% hit rate
  • Supports: .txt, .md, .pdf, .docx, .html

Workflow

  1. Ingest: rcli_rag_ingest() - Process documents and build index
  2. Load: rcli_rag_load_index() - Load existing index at startup
  3. Query: rcli_rag_query() - Retrieve context + LLM response
  4. Clear: rcli_rag_clear() - Unload index from memory

rcli_rag_ingest

Ingest documents from a directory and build a RAG index.
int rcli_rag_ingest(RCLIHandle handle, const char* dir_path);
handle
RCLIHandle
required
Engine handle (must be initialized)
dir_path
const char*
required
Path to directory containing documents. Scans recursively.Supported formats: .txt, .md, .pdf, .docx, .html
return
int
  • 0: Ingestion succeeded
  • -1: Failed (missing embedding model, invalid path, etc.)

Example

// Ingest documents from ~/Documents
if (rcli_rag_ingest(handle, "/Users/me/Documents") == 0) {
    printf("RAG index built successfully\n");
} else {
    fprintf(stderr, "Ingestion failed\n");
}
Ingestion automatically loads the index for querying after building.

Index Location

By default, the index is saved to:
  • macOS: ~/Library/RCLI/index/
  • Fallback: /tmp/rcli_index/
The index contains:
  • vector.usearch - HNSW vector index
  • bm25.json - BM25 term frequencies
  • chunks.json - Document chunks with metadata

Progress Callback

The implementation shows progress via stderr:
██████████···  324/512 chunks (63%)

Document Processing

  • Chunking: Documents split into 512-token chunks with 50-token overlap
  • Embedding: Snowflake Arctic Embed S (384-dim)
  • Batch size: 32 chunks per embedding batch
  • Metadata: Filename, chunk index, token count preserved

rcli_rag_load_index

Load a previously-built RAG index for querying.
int rcli_rag_load_index(RCLIHandle handle, const char* index_path);
handle
RCLIHandle
required
Engine handle (must be initialized)
index_path
const char*
required
Path to directory containing the RAG index files (vector.usearch, bm25.json, chunks.json)
return
int
  • 0: Index loaded successfully
  • -1: Failed (missing files, corrupted index, etc.)

Example: Startup Loading

RCLIHandle handle = rcli_create(NULL);
rcli_init(handle, "/path/to/models", 99);

// Load RAG index at startup
const char* index = "/Users/me/Library/RCLI/index";
if (rcli_rag_load_index(handle, index) == 0) {
    printf("RAG ready\n");
} else {
    printf("No RAG index found - queries will use plain LLM\n");
}
Call this once at startup, not per query. The index stays loaded until rcli_rag_clear() or rcli_destroy().

Requirements

The embedding model must be present:
  • models/snowflake-arctic-embed-s-q8_0.gguf (34 MB)
  • Downloaded via rcli setup or scripts/download_models.sh

rcli_rag_query

Query the RAG system: retrieve relevant chunks + generate LLM response.
const char* rcli_rag_query(RCLIHandle handle, const char* query);
handle
RCLIHandle
required
Engine handle (must have loaded RAG index)
query
const char*
required
User query text
return
const char*
LLM response grounded in retrieved context. Empty string if RAG not loaded. Do not free - owned by the engine.

Example

// Load index first
rcli_rag_load_index(handle, "/Users/me/Library/RCLI/index");

// Query
const char* answer = rcli_rag_query(handle, "What are the key decisions from the Q3 meeting?");
printf("Answer: %s\n", answer);

How It Works

  1. Embed query: Convert query text to 384-dim vector (~5ms)
  2. Hybrid retrieval:
    • Vector search: HNSW nearest neighbors
    • BM25 search: Keyword matching
    • Fusion: Reciprocal Rank Fusion (RRF) to merge results
  3. Retrieve top-5 chunks (~4ms total)
  4. Build context: Concatenate retrieved chunks
  5. LLM generation: Generate response with context in system prompt

Performance

OperationLatencyNotes
Query embedding~5msCached for repeated queries
Retrieval (5 chunks)~4msHNSW + BM25 + fusion
LLM generation~500msDepends on response length
Total~510msEnd-to-end RAG query

Retrieval Parameters

Current implementation:
  • Top-K: 5 chunks
  • Vector weight: 0.5
  • BM25 weight: 0.5
  • Max chunk tokens: 512
RAG queries do NOT update conversation history. Use rcli_process_command() for multi-turn conversations.

rcli_rag_clear

Clear the RAG index from memory (unload retriever + embeddings).
void rcli_rag_clear(RCLIHandle handle);
handle
RCLIHandle
required
Engine handle

Example

// Load and use RAG
rcli_rag_load_index(handle, "/path/to/index");
rcli_rag_query(handle, "What is...?");

// Free memory when done
rcli_rag_clear(handle);

// Queries now use plain LLM (no retrieval)
const char* response = rcli_process_command(handle, "What is...?");
After calling rcli_rag_clear(), queries revert to plain LLM mode. Reload the index with rcli_rag_load_index() to re-enable RAG.

Complete Example: RAG CLI

#include "api/rcli_api.h"
#include <stdio.h>
#include <string.h>

int main(int argc, char** argv) {
    if (argc < 3) {
        printf("Usage: %s <command> <path>\n", argv[0]);
        printf("Commands:\n");
        printf("  ingest <dir>   - Build RAG index from documents\n");
        printf("  query <text>   - Query the RAG system\n");
        return 1;
    }

    RCLIHandle handle = rcli_create(NULL);
    if (rcli_init(handle, "/path/to/models", 99) != 0) {
        fprintf(stderr, "Initialization failed\n");
        return 1;
    }

    const char* cmd = argv[1];
    const char* arg = argv[2];

    if (strcmp(cmd, "ingest") == 0) {
        printf("Ingesting documents from: %s\n", arg);
        if (rcli_rag_ingest(handle, arg) == 0) {
            printf("\nRAG index built successfully\n");
        } else {
            fprintf(stderr, "Ingestion failed\n");
            rcli_destroy(handle);
            return 1;
        }
    } else if (strcmp(cmd, "query") == 0) {
        // Load index from default location
        const char* index_path = "/Users/me/Library/RCLI/index";
        if (rcli_rag_load_index(handle, index_path) != 0) {
            fprintf(stderr, "Failed to load RAG index\n");
            rcli_destroy(handle);
            return 1;
        }

        printf("Query: %s\n\n", arg);
        const char* answer = rcli_rag_query(handle, arg);
        printf("Answer: %s\n", answer);
    } else {
        fprintf(stderr, "Unknown command: %s\n", cmd);
        rcli_destroy(handle);
        return 1;
    }

    rcli_destroy(handle);
    return 0;
}
Compile and run:
clang -o rag_cli rag_cli.c -L./build -lrcli

# Build index
./rag_cli ingest ~/Documents/project-notes

# Query
./rag_cli query "What are the action items from last week?"

Advanced: Custom Index Path

Store the index path in the engine during ingestion:
// The engine remembers the index path from rcli_rag_ingest()
rcli_rag_ingest(handle, "/Users/me/docs");
// Index saved to ~/Library/RCLI/index/

// Later: load from remembered path
rcli_rag_load_index(handle, "/Users/me/Library/RCLI/index");

Troubleshooting

”Embedding model not found"

# Download the embedding model
bash scripts/download_models.sh

# Or manually:
rcli setup

"Failed to load RAG index”

Check that the index directory contains:
  • vector.usearch
  • bm25.json
  • chunks.json

”No results from RAG query”

  • Check if index is loaded: Re-run rcli_rag_load_index()
  • Verify documents were ingested: Check index file timestamps
  • Try broader query terms

See Also

  • Benchmarks - RAG performance testing
  • State Management - Check RAG readiness
  • RCLI CLI: rcli rag ingest <dir> and rcli rag query <text>

Build docs developers (and LLMs) love