RAG (Retrieval-Augmented Generation)

Overview

RCLI’s RAG system combines vector search (HNSW) and BM25 full-text search for hybrid retrieval. Ingest documents, build an index, then query with LLM-generated responses grounded in your data. Key features:

Hybrid retrieval: Vector embeddings + BM25 keyword search
Fast indexing: 32-batch embedding with progress callbacks
Low latency: ~4ms retrieval over 5K+ chunks
LRU embedding cache: 99.9% hit rate
Supports: .txt, .md, .pdf, .docx, .html

Workflow

Ingest: rcli_rag_ingest() - Process documents and build index
Load: rcli_rag_load_index() - Load existing index at startup
Query: rcli_rag_query() - Retrieve context + LLM response
Clear: rcli_rag_clear() - Unload index from memory

rcli_rag_ingest

Ingest documents from a directory and build a RAG index.

int rcli_rag_ingest(RCLIHandle handle, const char* dir_path);

handle

RCLIHandle

required

Engine handle (must be initialized)

dir_path

const char*

required

Path to directory containing documents. Scans recursively.Supported formats: .txt, .md, .pdf, .docx, .html

return

int

0: Ingestion succeeded
-1: Failed (missing embedding model, invalid path, etc.)

Example

// Ingest documents from ~/Documents
if (rcli_rag_ingest(handle, "/Users/me/Documents") == 0) {
    printf("RAG index built successfully\n");
} else {
    fprintf(stderr, "Ingestion failed\n");
}

Ingestion automatically loads the index for querying after building.

Index Location

By default, the index is saved to:

macOS: ~/Library/RCLI/index/
Fallback: /tmp/rcli_index/

The index contains:

vector.usearch - HNSW vector index
bm25.json - BM25 term frequencies
chunks.json - Document chunks with metadata

Progress Callback

The implementation shows progress via stderr:

██████████···  324/512 chunks (63%)

Document Processing

Chunking: Documents split into 512-token chunks with 50-token overlap
Embedding: Snowflake Arctic Embed S (384-dim)
Batch size: 32 chunks per embedding batch
Metadata: Filename, chunk index, token count preserved

rcli_rag_load_index

Load a previously-built RAG index for querying.

int rcli_rag_load_index(RCLIHandle handle, const char* index_path);

handle

RCLIHandle

required

Engine handle (must be initialized)

index_path

const char*

required

Path to directory containing the RAG index files (vector.usearch, bm25.json, chunks.json)

return

int

0: Index loaded successfully
-1: Failed (missing files, corrupted index, etc.)

Example: Startup Loading

RCLIHandle handle = rcli_create(NULL);
rcli_init(handle, "/path/to/models", 99);

// Load RAG index at startup
const char* index = "/Users/me/Library/RCLI/index";
if (rcli_rag_load_index(handle, index) == 0) {
    printf("RAG ready\n");
} else {
    printf("No RAG index found - queries will use plain LLM\n");
}

Call this once at startup, not per query. The index stays loaded until rcli_rag_clear() or rcli_destroy().

Requirements

The embedding model must be present:

models/snowflake-arctic-embed-s-q8_0.gguf (34 MB)
Downloaded via rcli setup or scripts/download_models.sh

rcli_rag_query

Query the RAG system: retrieve relevant chunks + generate LLM response.

const char* rcli_rag_query(RCLIHandle handle, const char* query);

handle

RCLIHandle

required

Engine handle (must have loaded RAG index)

query

const char*

required

User query text

return

const char*

LLM response grounded in retrieved context. Empty string if RAG not loaded. Do not free - owned by the engine.

Example

// Load index first
rcli_rag_load_index(handle, "/Users/me/Library/RCLI/index");

// Query
const char* answer = rcli_rag_query(handle, "What are the key decisions from the Q3 meeting?");
printf("Answer: %s\n", answer);

How It Works

Embed query: Convert query text to 384-dim vector (~5ms)
Hybrid retrieval:
- Vector search: HNSW nearest neighbors
- BM25 search: Keyword matching
- Fusion: Reciprocal Rank Fusion (RRF) to merge results
Retrieve top-5 chunks (~4ms total)
Build context: Concatenate retrieved chunks
LLM generation: Generate response with context in system prompt

Performance

Operation	Latency	Notes
Query embedding	~5ms	Cached for repeated queries
Retrieval (5 chunks)	~4ms	HNSW + BM25 + fusion
LLM generation	~500ms	Depends on response length
Total	~510ms	End-to-end RAG query

Retrieval Parameters

Current implementation:

Top-K: 5 chunks
Vector weight: 0.5
BM25 weight: 0.5
Max chunk tokens: 512

RAG queries do NOT update conversation history. Use rcli_process_command() for multi-turn conversations.

rcli_rag_clear

Clear the RAG index from memory (unload retriever + embeddings).

void rcli_rag_clear(RCLIHandle handle);

handle

RCLIHandle

required

Engine handle

Example

// Load and use RAG
rcli_rag_load_index(handle, "/path/to/index");
rcli_rag_query(handle, "What is...?");

// Free memory when done
rcli_rag_clear(handle);

// Queries now use plain LLM (no retrieval)
const char* response = rcli_process_command(handle, "What is...?");

After calling rcli_rag_clear(), queries revert to plain LLM mode. Reload the index with rcli_rag_load_index() to re-enable RAG.

Complete Example: RAG CLI

#include "api/rcli_api.h"
#include <stdio.h>
#include <string.h>

int main(int argc, char** argv) {
    if (argc < 3) {
        printf("Usage: %s <command> <path>\n", argv[0]);
        printf("Commands:\n");
        printf("  ingest <dir>   - Build RAG index from documents\n");
        printf("  query <text>   - Query the RAG system\n");
        return 1;
    }

    RCLIHandle handle = rcli_create(NULL);
    if (rcli_init(handle, "/path/to/models", 99) != 0) {
        fprintf(stderr, "Initialization failed\n");
        return 1;
    }

    const char* cmd = argv[1];
    const char* arg = argv[2];

    if (strcmp(cmd, "ingest") == 0) {
        printf("Ingesting documents from: %s\n", arg);
        if (rcli_rag_ingest(handle, arg) == 0) {
            printf("\nRAG index built successfully\n");
        } else {
            fprintf(stderr, "Ingestion failed\n");
            rcli_destroy(handle);
            return 1;
        }
    } else if (strcmp(cmd, "query") == 0) {
        // Load index from default location
        const char* index_path = "/Users/me/Library/RCLI/index";
        if (rcli_rag_load_index(handle, index_path) != 0) {
            fprintf(stderr, "Failed to load RAG index\n");
            rcli_destroy(handle);
            return 1;
        }

        printf("Query: %s\n\n", arg);
        const char* answer = rcli_rag_query(handle, arg);
        printf("Answer: %s\n", answer);
    } else {
        fprintf(stderr, "Unknown command: %s\n", cmd);
        rcli_destroy(handle);
        return 1;
    }

    rcli_destroy(handle);
    return 0;
}

Compile and run:

clang -o rag_cli rag_cli.c -L./build -lrcli

# Build index
./rag_cli ingest ~/Documents/project-notes

# Query
./rag_cli query "What are the action items from last week?"

Advanced: Custom Index Path

Store the index path in the engine during ingestion:

// The engine remembers the index path from rcli_rag_ingest()
rcli_rag_ingest(handle, "/Users/me/docs");
// Index saved to ~/Library/RCLI/index/

// Later: load from remembered path
rcli_rag_load_index(handle, "/Users/me/Library/RCLI/index");

Troubleshooting

”Embedding model not found"

# Download the embedding model
bash scripts/download_models.sh

# Or manually:
rcli setup

"Failed to load RAG index”

Check that the index directory contains:

vector.usearch
bm25.json
chunks.json

”No results from RAG query”

Check if index is loaded: Re-run rcli_rag_load_index()
Verify documents were ingested: Check index file timestamps
Try broader query terms

C API

CLI Reference

RAG (Retrieval-Augmented Generation)

Overview

Workflow

rcli_rag_ingest

Example

Index Location

Progress Callback

Document Processing

rcli_rag_load_index

Example: Startup Loading

Requirements

rcli_rag_query

Example

How It Works

Performance

Retrieval Parameters

rcli_rag_clear

Example

Complete Example: RAG CLI

Advanced: Custom Index Path

Troubleshooting

”Embedding model not found"

"Failed to load RAG index”

”No results from RAG query”

See Also

Build docs developers (and LLMs) love

C API

CLI Reference

​Overview

​Workflow

​rcli_rag_ingest

​Example

​Index Location

​Progress Callback

​Document Processing

​rcli_rag_load_index

​Example: Startup Loading

​Requirements

​rcli_rag_query

​Example

​How It Works

​Performance

​Retrieval Parameters

​rcli_rag_clear

​Example

​Complete Example: RAG CLI

​Advanced: Custom Index Path

​Troubleshooting

​”Embedding model not found"

​"Failed to load RAG index”

​”No results from RAG query”

​See Also

Build docs developers (and LLMs) love

Overview

Workflow

rcli_rag_ingest

Example

Index Location

Progress Callback

Document Processing

rcli_rag_load_index

Example: Startup Loading

Requirements

rcli_rag_query

Example

How It Works

Performance

Retrieval Parameters

rcli_rag_clear

Example

Complete Example: RAG CLI

Advanced: Custom Index Path

Troubleshooting

”Embedding model not found"

"Failed to load RAG index”

”No results from RAG query”

See Also