Project Structure

RCLI is organized into distinct modules for voice processing, RAG, actions, and the CLI. This guide explains the directory structure and how components interact.

High-Level Architecture

Mic → VAD → STT (Zipformer) → [RAG Retrieval] → LLM (Qwen3) → TTS (Piper) → Speaker
                                                      ↓
                                              Tool Calling → macOS Actions

Core components:

Engines — ML inference wrappers (STT, LLM, TTS, VAD, embeddings)
Pipeline — Orchestrator coordinates data flow between engines
RAG — Hybrid retrieval (vector + BM25) over local documents
Actions — 43 macOS integrations via AppleScript and shell
CLI — Interactive TUI and command-line interface

Directory Structure

RCLI/
├── src/                    # C++ source code
│   ├── engines/            # ML engine wrappers
│   ├── pipeline/           # Orchestrator and sentence detection
│   ├── rag/                # RAG retrieval system
│   ├── core/               # Core types and utilities
│   ├── audio/              # CoreAudio I/O
│   ├── tools/              # Tool calling engine
│   ├── bench/              # Benchmark harness
│   ├── actions/            # macOS action implementations
│   ├── api/                # Public C API
│   ├── cli/                # TUI and CLI commands
│   ├── models/             # Model registries
│   └── test/               # Test harness
├── deps/                   # Dependencies (gitignored)
│   ├── llama.cpp/          # Cloned by scripts/setup.sh
│   └── sherpa-onnx/        # Cloned by scripts/setup.sh
├── scripts/                # Build and setup scripts
├── Formula/                # Homebrew formula
├── CMakeLists.txt          # CMake build configuration
└── README.md

src/ Modules

engines/

ML inference wrappers for each modality:

engines/ Contents

File	Purpose
`stt_engine.cpp/.h`	Speech-to-text via sherpa-onnx (Zipformer, Whisper, Parakeet)
`llm_engine.cpp/.h`	LLM inference via llama.cpp with Metal GPU
`tts_engine.cpp/.h`	Text-to-speech via sherpa-onnx (Piper, Kokoro, KittenTTS)
`vad_engine.cpp/.h`	Voice activity detection (Silero VAD)
`embedding_engine.cpp/.h`	Text embeddings for RAG (Snowflake Arctic Embed)
`model_profile.cpp/.h`	Model metadata, chat templates, tool call parsing

Design:

Each engine wraps a C API (llama.cpp, sherpa-onnx)
Engines are initialized once and reused across queries
Metal GPU acceleration for LLM and embeddings
ONNX Runtime for STT/TTS/VAD

pipeline/

Orchestrates data flow between engines:

pipeline/ Contents

File	Purpose
`orchestrator.cpp/.h`	Central class that owns all engines and coordinates the pipeline
`sentence_detector.cpp/.h`	Accumulates LLM tokens and flushes complete sentences to TTS
`text_sanitizer.h`	Removes non-speech text (markdown, XML tags) before TTS

Orchestrator responsibilities:

Manages pipeline state (IDLE → LISTENING → PROCESSING → SPEAKING)
Runs STT/LLM/TTS threads
Dispatches tool calls to ActionRegistry
Maintains conversation history with token-budget trimming
System prompt KV caching for fast response

rag/

Hybrid retrieval system for local documents:

rag/ Contents

File	Purpose
`vector_index.cpp/.h`	HNSW vector search via USearch
`bm25_index.cpp/.h`	Full-text search with BM25 ranking
`hybrid_retriever.cpp/.h`	Combines vector + BM25 via Reciprocal Rank Fusion
`document_processor.cpp/.h`	Chunks documents (PDF, DOCX, TXT) into 512-token segments
`index_builder.cpp/.h`	Builds and persists indices

Retrieval flow:

Query is embedded via embedding_engine
Vector search (HNSW) finds nearest chunks
BM25 search finds keyword-matching chunks
Results fused via RRF (Reciprocal Rank Fusion)
Top-k chunks injected into LLM context

Performance: ~4ms retrieval over 5K+ chunks (M3 Max)

core/

Core types and utilities:

core/ Contents

File	Purpose
`types.h`	Shared types (`ToolCall`, `ToolResult`, `PipelineState`, etc.)
`ring_buffer.h`	Lock-free ring buffer for zero-copy audio transfer
`memory_pool.h`	Pre-allocated 64 MB arena (no runtime malloc)
`hardware_profile.h`	Detects P-cores, E-cores, Metal GPU, RAM
`log.h`	Logging macros (`LOG_INFO`, `LOG_ERROR`)
`base64.h`	Base64 encoding/decoding
`string_utils.h`	String manipulation utilities
`file_utils.h`	File I/O helpers

Key design patterns:

Lock-free ring buffer — zero-copy audio passing between threads
Pre-allocated memory pool — 64 MB arena allocated at init
Hardware profiling — adapts thread count and GPU layers to hardware

audio/

CoreAudio microphone and speaker I/O:

audio/ Contents

File	Purpose
`audio_io.cpp/.h`	CoreAudio input/output streams
`mic_permission.h/.mm`	Microphone permission request (Objective-C)

Features:

16 kHz mono capture for STT
24 kHz mono playback for TTS
Buffer size: 512 samples (32ms at 16 kHz)
Minimal latency configuration

tools/

Tool calling engine:

tools/ Contents

File	Purpose
`tool_engine.cpp/.h`	Parses LLM tool calls and dispatches to `ActionRegistry`

Tool calling flow:

LLM generates tool call in model-native format (e.g., Qwen3’s <tool_call>)
ToolEngine parses via ModelProfile::parse_tool_calls()
Dispatches to ActionRegistry::execute()
Returns result to LLM

Supported formats:

Qwen3: <tool_call>{...}</tool_call>
LFM2: <|tool_call_start|>{...}<|tool_call_end|>
Generic JSON: {"name": "...", "arguments": {...}}

bench/

Benchmark harness:

bench/ Contents

File	Purpose
`benchmark.cpp/.h`	Runs STT, LLM, TTS, E2E, RAG, tools, memory benchmarks

Suites:

stt — Transcription latency and accuracy
llm — Time to first token, throughput (tok/s)
tts — Synthesis latency
e2e — Voice-in to audio-out latency
rag — Retrieval latency (vector + BM25)
tools — Tool calling accuracy and latency
memory — Peak memory usage
all — All suites

Usage:

rcli bench --suite llm
rcli bench --all-llm --suite llm    # Compare all LLMs
rcli bench --output results.json

actions/

macOS action implementations:

actions/ Contents

File	Purpose
`action_registry.cpp/.h`	Registers actions and dispatches execution
`action_helpers.h`	JSON parsing, string escaping utilities
`applescript_executor.cpp/.h`	Executes AppleScript and shell commands
`register_all.cpp`	Calls all registration functions
Category files:
`notes_actions.cpp/.h`	Apple Notes integration
`reminders_actions.cpp/.h`	Reminders integration
`messages_actions.cpp/.h`	Messages/iMessage
`app_control_actions.cpp/.h`	Open/quit apps
`window_actions.cpp/.h`	Window management
`system_actions.cpp/.h`	System settings (volume, dark mode, lock)
`media_actions.cpp/.h`	Spotify/Apple Music
`web_actions.cpp/.h`	Web search
`browser_actions.cpp/.h`	Safari/Chrome control
`clipboard_actions.cpp/.h`	Clipboard read/write
`files_actions.cpp/.h`	File search
`navigation_actions.cpp/.h`	Maps integration
`communication_actions.cpp/.h`	FaceTime

43 actions total — see Adding Actions

api/

Public C API:

api/ Contents

File	Purpose
`rcli_api.h`	Public C API header (all engine functionality)
`rcli_api.cpp`	API implementation

Exported functions:

rcli_init() — Initialize pipeline
rcli_query() — One-shot text query
rcli_start_listen() — Start continuous voice mode
rcli_stop_listen() — Stop listening
rcli_cleanup() — Shutdown pipeline

Use case: Embed RCLI in other applications

cli/

CLI and TUI:

cli/ Contents

File	Purpose
`main.cpp`	Entry point, argument parsing, command dispatch
`tui_dashboard.h`	Interactive TUI dashboard (FTXUI)
`tui_app.h`	TUI event loop
`actions_cli.h`	Actions panel (browse, enable/disable, execute)
`model_pickers.h`	Model management (LLM, STT, TTS)
`help.h`	CLI help text
`setup_cmds.h`	`rcli setup`, `rcli cleanup` commands
`visualizer.h`	Waveform visualizer
`cli_common.h`	Shared CLI utilities

TUI features:

Push-to-talk (SPACE bar)
Models panel (M) — browse, download, hot-swap
Actions panel (A) — enable/disable actions
Benchmarks panel (B) — run performance tests
RAG panel (R) — ingest documents
Cleanup panel (D) — remove unused models
Tool call trace (T) — debug LLM tool calls

models/

Model registries:

models/ Contents

File	Purpose
`model_registry.h`	LLM model definitions (id, URL, size, speed, tool calling)
`tts_model_registry.h`	TTS voice definitions
`stt_model_registry.h`	STT model definitions

Model metadata:

Download URL (Hugging Face)
Size (MB)
Speed estimate (tokens/sec)
Tool calling capability
Default/recommended flags

Usage:

rcli models              # Interactive picker
rcli upgrade-llm         # Guided LLM upgrade
rcli voices              # TTS voice picker

test/

Test harness:

test/ Contents

File	Purpose
`test_pipeline.cpp`	Pipeline integration tests

Test modes:

--actions-only — Fast, no models needed
--llm-only — LLM inference tests
--stt-only — STT transcription tests
--tts-only — TTS synthesis tests
--api-only — C API tests

Usage:

./rcli_test ~/Library/RCLI/models
./rcli_test ~/Library/RCLI/models --actions-only

Key Design Patterns

Orchestrator Pattern

The Orchestrator class owns all engines and coordinates data flow:

src/pipeline/orchestrator.h

class Orchestrator {
    STTEngine       stt_;
    LLMEngine       llm_;
    TTSEngine       tts_;
    VADEngine       vad_;
    EmbeddingEngine embedding_;
    ToolEngine      tool_engine_;
    ActionRegistry  action_registry_;
    HybridRetriever rag_retriever_;

    std::atomic<PipelineState> state_;
    // ...
};

Benefits:

Single source of truth for pipeline state
Simplified thread coordination
Easy to add new engines

Lock-Free Ring Buffer

Zero-copy audio transfer between threads:

src/core/ring_buffer.h

template<typename T>
class RingBuffer {
    std::atomic<size_t> read_pos_;
    std::atomic<size_t> write_pos_;
    std::vector<T> buffer_;
    // ...
};

Benefits:

No mutex contention
Zero-copy (pointers only)
Fixed allocation (no runtime malloc)

Pre-Allocated Memory Pool

64 MB arena allocated at init:

src/core/memory_pool.h

class MemoryPool {
    std::vector<uint8_t> pool_;
    size_t offset_ = 0;

    void* allocate(size_t size) {
        void* ptr = &pool_[offset_];
        offset_ += size;
        return ptr;
    }
};

Benefits:

No runtime malloc during inference
Predictable latency
Cache-friendly (contiguous memory)

System Prompt KV Caching

Reuses llama.cpp KV cache across queries:

src/engines/llm_engine.cpp

void LLMEngine::generate(const std::string& prompt) {
    // First query: process system prompt + user input
    if (!kv_cache_initialized_) {
        eval_tokens(system_prompt_tokens_);
        save_kv_cache();
        kv_cache_initialized_ = true;
    } else {
        // Subsequent queries: restore system prompt cache
        restore_kv_cache();
    }
    eval_tokens(user_input_tokens_);
    // ...
}

Benefits:

Avoids reprocessing system prompt (saves ~20-30ms)
Lower latency for multi-turn conversations

Sentence-Level TTS Scheduling

TTS synthesizes complete sentences, not token-by-token:

src/pipeline/sentence_detector.cpp

void SentenceDetector::add_token(const std::string& token) {
    buffer_ += token;
    if (is_sentence_boundary(buffer_)) {
        flush_sentence(buffer_);
        buffer_.clear();
    }
}

Benefits:

Natural prosody (TTS sees full sentences)
Double-buffered playback (next sentence synthesizes while current plays)
Lower latency than waiting for full LLM response

Threading Model

Three threads run concurrently in live mode:

STT Thread

Captures mic audio via CoreAudio
Runs Silero VAD to filter silence
Detects speech endpoints
Transcribes via Zipformer (streaming) or Whisper (batch)
Signals LLM thread when transcription is ready

LLM Thread

Waits for STT output (std::condition_variable)
Generates tokens via llama.cpp with Metal GPU
Parses tool calls and dispatches to ActionRegistry
Feeds sentences to TTS via SentenceDetector
Maintains conversation history with token-budget trimming

TTS Thread

Queues sentences from LLM
Synthesizes audio via sherpa-onnx (Piper/Kokoro)
Double-buffered playback (synthesizes next while playing current)
Outputs to CoreAudio speaker

Synchronization:

std::condition_variable for thread wakeup
std::atomic<PipelineState> for state transitions
Lock-free ring buffers for audio transfer

Dependencies

Vendored (deps/)

Cloned by scripts/setup.sh:

llama.cpp — LLM + embedding inference with Metal GPU
sherpa-onnx — STT/TTS/VAD via ONNX Runtime

Fetched by CMake

Automatic via FetchContent:

USearch v2.16.5 — HNSW vector index (header-only)
FTXUI v5.0.0 — Terminal UI library

macOS System Frameworks

CoreAudio, AudioToolbox, AudioUnit
Foundation, AVFoundation
IOKit (hardware monitoring)
Metal, MetalKit (GPU acceleration)

Build Outputs

build/
├── rcli                # Main CLI executable
├── rcli_test           # Test executable
├── librcli.a           # Static library (engine + actions)
└── lib/
    ├── libllama.dylib
    ├── libggml.dylib
    └── libsherpa-onnx-c-api.dylib

Configuration Files

Runtime configuration stored in ~/Library/RCLI/:

~/Library/RCLI/
├── models/             # Downloaded models
│   ├── llm/
│   ├── stt/
│   ├── tts/
│   ├── vad/
│   └── embeddings/
├── index/              # RAG indices
│   ├── vector.index
│   ├── bm25.index
│   └── metadata.json
└── config/
    ├── actions.json    # Enabled/disabled actions
    ├── active_models.json  # Active model selection
    └── settings.json   # User preferences

Next Steps

Building from Source

Build and install RCLI locally

Adding Actions

Extend RCLI with custom macOS actions

Contributing

Submit changes and improvements

Get Started

Core Features

Commands

Models

Actions

Advanced

Development

​High-Level Architecture

​Directory Structure

​src/ Modules

​engines/

​pipeline/

​rag/

​core/

​audio/

​tools/

​bench/

​actions/

​api/

​cli/

​models/

​test/

​Key Design Patterns

​Orchestrator Pattern

​Lock-Free Ring Buffer

​Pre-Allocated Memory Pool

​System Prompt KV Caching

​Sentence-Level TTS Scheduling

​Threading Model

​Dependencies

​Vendored (deps/)

​Fetched by CMake

​macOS System Frameworks

​Build Outputs

​Configuration Files

​Next Steps

Building from Source

Adding Actions

Contributing

Build docs developers (and LLMs) love

High-Level Architecture

Directory Structure

src/ Modules

engines/

pipeline/

rag/

core/

audio/

tools/

bench/

actions/

api/

cli/

models/

test/

Key Design Patterns

Orchestrator Pattern

Lock-Free Ring Buffer

Pre-Allocated Memory Pool

System Prompt KV Caching

Sentence-Level TTS Scheduling

Threading Model

Dependencies

Vendored (deps/)

Fetched by CMake

macOS System Frameworks

Build Outputs

Configuration Files

Next Steps