Skip to main content
RCLI is organized into distinct modules for voice processing, RAG, actions, and the CLI. This guide explains the directory structure and how components interact.

High-Level Architecture

Mic → VAD → STT (Zipformer) → [RAG Retrieval] → LLM (Qwen3) → TTS (Piper) → Speaker

                                              Tool Calling → macOS Actions
Core components:
  • Engines — ML inference wrappers (STT, LLM, TTS, VAD, embeddings)
  • Pipeline — Orchestrator coordinates data flow between engines
  • RAG — Hybrid retrieval (vector + BM25) over local documents
  • Actions — 43 macOS integrations via AppleScript and shell
  • CLI — Interactive TUI and command-line interface

Directory Structure

RCLI/
├── src/                    # C++ source code
│   ├── engines/            # ML engine wrappers
│   ├── pipeline/           # Orchestrator and sentence detection
│   ├── rag/                # RAG retrieval system
│   ├── core/               # Core types and utilities
│   ├── audio/              # CoreAudio I/O
│   ├── tools/              # Tool calling engine
│   ├── bench/              # Benchmark harness
│   ├── actions/            # macOS action implementations
│   ├── api/                # Public C API
│   ├── cli/                # TUI and CLI commands
│   ├── models/             # Model registries
│   └── test/               # Test harness
├── deps/                   # Dependencies (gitignored)
│   ├── llama.cpp/          # Cloned by scripts/setup.sh
│   └── sherpa-onnx/        # Cloned by scripts/setup.sh
├── scripts/                # Build and setup scripts
├── Formula/                # Homebrew formula
├── CMakeLists.txt          # CMake build configuration
└── README.md

src/ Modules

engines/

ML inference wrappers for each modality:
FilePurpose
stt_engine.cpp/.hSpeech-to-text via sherpa-onnx (Zipformer, Whisper, Parakeet)
llm_engine.cpp/.hLLM inference via llama.cpp with Metal GPU
tts_engine.cpp/.hText-to-speech via sherpa-onnx (Piper, Kokoro, KittenTTS)
vad_engine.cpp/.hVoice activity detection (Silero VAD)
embedding_engine.cpp/.hText embeddings for RAG (Snowflake Arctic Embed)
model_profile.cpp/.hModel metadata, chat templates, tool call parsing
Design:
  • Each engine wraps a C API (llama.cpp, sherpa-onnx)
  • Engines are initialized once and reused across queries
  • Metal GPU acceleration for LLM and embeddings
  • ONNX Runtime for STT/TTS/VAD

pipeline/

Orchestrates data flow between engines:
FilePurpose
orchestrator.cpp/.hCentral class that owns all engines and coordinates the pipeline
sentence_detector.cpp/.hAccumulates LLM tokens and flushes complete sentences to TTS
text_sanitizer.hRemoves non-speech text (markdown, XML tags) before TTS
Orchestrator responsibilities:
  • Manages pipeline state (IDLE → LISTENING → PROCESSING → SPEAKING)
  • Runs STT/LLM/TTS threads
  • Dispatches tool calls to ActionRegistry
  • Maintains conversation history with token-budget trimming
  • System prompt KV caching for fast response

rag/

Hybrid retrieval system for local documents:
FilePurpose
vector_index.cpp/.hHNSW vector search via USearch
bm25_index.cpp/.hFull-text search with BM25 ranking
hybrid_retriever.cpp/.hCombines vector + BM25 via Reciprocal Rank Fusion
document_processor.cpp/.hChunks documents (PDF, DOCX, TXT) into 512-token segments
index_builder.cpp/.hBuilds and persists indices
Retrieval flow:
  1. Query is embedded via embedding_engine
  2. Vector search (HNSW) finds nearest chunks
  3. BM25 search finds keyword-matching chunks
  4. Results fused via RRF (Reciprocal Rank Fusion)
  5. Top-k chunks injected into LLM context
Performance: ~4ms retrieval over 5K+ chunks (M3 Max)

core/

Core types and utilities:
FilePurpose
types.hShared types (ToolCall, ToolResult, PipelineState, etc.)
ring_buffer.hLock-free ring buffer for zero-copy audio transfer
memory_pool.hPre-allocated 64 MB arena (no runtime malloc)
hardware_profile.hDetects P-cores, E-cores, Metal GPU, RAM
log.hLogging macros (LOG_INFO, LOG_ERROR)
base64.hBase64 encoding/decoding
string_utils.hString manipulation utilities
file_utils.hFile I/O helpers
Key design patterns:
  • Lock-free ring buffer — zero-copy audio passing between threads
  • Pre-allocated memory pool — 64 MB arena allocated at init
  • Hardware profiling — adapts thread count and GPU layers to hardware

audio/

CoreAudio microphone and speaker I/O:
FilePurpose
audio_io.cpp/.hCoreAudio input/output streams
mic_permission.h/.mmMicrophone permission request (Objective-C)
Features:
  • 16 kHz mono capture for STT
  • 24 kHz mono playback for TTS
  • Buffer size: 512 samples (32ms at 16 kHz)
  • Minimal latency configuration

tools/

Tool calling engine:
FilePurpose
tool_engine.cpp/.hParses LLM tool calls and dispatches to ActionRegistry
Tool calling flow:
  1. LLM generates tool call in model-native format (e.g., Qwen3’s <tool_call>)
  2. ToolEngine parses via ModelProfile::parse_tool_calls()
  3. Dispatches to ActionRegistry::execute()
  4. Returns result to LLM
Supported formats:
  • Qwen3: <tool_call>{...}</tool_call>
  • LFM2: <|tool_call_start|>{...}<|tool_call_end|>
  • Generic JSON: {"name": "...", "arguments": {...}}

bench/

Benchmark harness:
FilePurpose
benchmark.cpp/.hRuns STT, LLM, TTS, E2E, RAG, tools, memory benchmarks
Suites:
  • stt — Transcription latency and accuracy
  • llm — Time to first token, throughput (tok/s)
  • tts — Synthesis latency
  • e2e — Voice-in to audio-out latency
  • rag — Retrieval latency (vector + BM25)
  • tools — Tool calling accuracy and latency
  • memory — Peak memory usage
  • all — All suites
Usage:
rcli bench --suite llm
rcli bench --all-llm --suite llm    # Compare all LLMs
rcli bench --output results.json

actions/

macOS action implementations:
FilePurpose
action_registry.cpp/.hRegisters actions and dispatches execution
action_helpers.hJSON parsing, string escaping utilities
applescript_executor.cpp/.hExecutes AppleScript and shell commands
register_all.cppCalls all registration functions
Category files:
notes_actions.cpp/.hApple Notes integration
reminders_actions.cpp/.hReminders integration
messages_actions.cpp/.hMessages/iMessage
app_control_actions.cpp/.hOpen/quit apps
window_actions.cpp/.hWindow management
system_actions.cpp/.hSystem settings (volume, dark mode, lock)
media_actions.cpp/.hSpotify/Apple Music
web_actions.cpp/.hWeb search
browser_actions.cpp/.hSafari/Chrome control
clipboard_actions.cpp/.hClipboard read/write
files_actions.cpp/.hFile search
navigation_actions.cpp/.hMaps integration
communication_actions.cpp/.hFaceTime
43 actions total — see Adding Actions

api/

Public C API:
FilePurpose
rcli_api.hPublic C API header (all engine functionality)
rcli_api.cppAPI implementation
Exported functions:
  • rcli_init() — Initialize pipeline
  • rcli_query() — One-shot text query
  • rcli_start_listen() — Start continuous voice mode
  • rcli_stop_listen() — Stop listening
  • rcli_cleanup() — Shutdown pipeline
Use case: Embed RCLI in other applications

cli/

CLI and TUI:
FilePurpose
main.cppEntry point, argument parsing, command dispatch
tui_dashboard.hInteractive TUI dashboard (FTXUI)
tui_app.hTUI event loop
actions_cli.hActions panel (browse, enable/disable, execute)
model_pickers.hModel management (LLM, STT, TTS)
help.hCLI help text
setup_cmds.hrcli setup, rcli cleanup commands
visualizer.hWaveform visualizer
cli_common.hShared CLI utilities
TUI features:
  • Push-to-talk (SPACE bar)
  • Models panel (M) — browse, download, hot-swap
  • Actions panel (A) — enable/disable actions
  • Benchmarks panel (B) — run performance tests
  • RAG panel (R) — ingest documents
  • Cleanup panel (D) — remove unused models
  • Tool call trace (T) — debug LLM tool calls

models/

Model registries:
FilePurpose
model_registry.hLLM model definitions (id, URL, size, speed, tool calling)
tts_model_registry.hTTS voice definitions
stt_model_registry.hSTT model definitions
Model metadata:
  • Download URL (Hugging Face)
  • Size (MB)
  • Speed estimate (tokens/sec)
  • Tool calling capability
  • Default/recommended flags
Usage:
rcli models              # Interactive picker
rcli upgrade-llm         # Guided LLM upgrade
rcli voices              # TTS voice picker

test/

Test harness:
FilePurpose
test_pipeline.cppPipeline integration tests
Test modes:
  • --actions-only — Fast, no models needed
  • --llm-only — LLM inference tests
  • --stt-only — STT transcription tests
  • --tts-only — TTS synthesis tests
  • --api-only — C API tests
Usage:
./rcli_test ~/Library/RCLI/models
./rcli_test ~/Library/RCLI/models --actions-only

Key Design Patterns

Orchestrator Pattern

The Orchestrator class owns all engines and coordinates data flow:
src/pipeline/orchestrator.h
class Orchestrator {
    STTEngine       stt_;
    LLMEngine       llm_;
    TTSEngine       tts_;
    VADEngine       vad_;
    EmbeddingEngine embedding_;
    ToolEngine      tool_engine_;
    ActionRegistry  action_registry_;
    HybridRetriever rag_retriever_;

    std::atomic<PipelineState> state_;
    // ...
};
Benefits:
  • Single source of truth for pipeline state
  • Simplified thread coordination
  • Easy to add new engines

Lock-Free Ring Buffer

Zero-copy audio transfer between threads:
src/core/ring_buffer.h
template<typename T>
class RingBuffer {
    std::atomic<size_t> read_pos_;
    std::atomic<size_t> write_pos_;
    std::vector<T> buffer_;
    // ...
};
Benefits:
  • No mutex contention
  • Zero-copy (pointers only)
  • Fixed allocation (no runtime malloc)

Pre-Allocated Memory Pool

64 MB arena allocated at init:
src/core/memory_pool.h
class MemoryPool {
    std::vector<uint8_t> pool_;
    size_t offset_ = 0;

    void* allocate(size_t size) {
        void* ptr = &pool_[offset_];
        offset_ += size;
        return ptr;
    }
};
Benefits:
  • No runtime malloc during inference
  • Predictable latency
  • Cache-friendly (contiguous memory)

System Prompt KV Caching

Reuses llama.cpp KV cache across queries:
src/engines/llm_engine.cpp
void LLMEngine::generate(const std::string& prompt) {
    // First query: process system prompt + user input
    if (!kv_cache_initialized_) {
        eval_tokens(system_prompt_tokens_);
        save_kv_cache();
        kv_cache_initialized_ = true;
    } else {
        // Subsequent queries: restore system prompt cache
        restore_kv_cache();
    }
    eval_tokens(user_input_tokens_);
    // ...
}
Benefits:
  • Avoids reprocessing system prompt (saves ~20-30ms)
  • Lower latency for multi-turn conversations

Sentence-Level TTS Scheduling

TTS synthesizes complete sentences, not token-by-token:
src/pipeline/sentence_detector.cpp
void SentenceDetector::add_token(const std::string& token) {
    buffer_ += token;
    if (is_sentence_boundary(buffer_)) {
        flush_sentence(buffer_);
        buffer_.clear();
    }
}
Benefits:
  • Natural prosody (TTS sees full sentences)
  • Double-buffered playback (next sentence synthesizes while current plays)
  • Lower latency than waiting for full LLM response

Threading Model

Three threads run concurrently in live mode:
1

STT Thread

  • Captures mic audio via CoreAudio
  • Runs Silero VAD to filter silence
  • Detects speech endpoints
  • Transcribes via Zipformer (streaming) or Whisper (batch)
  • Signals LLM thread when transcription is ready
2

LLM Thread

  • Waits for STT output (std::condition_variable)
  • Generates tokens via llama.cpp with Metal GPU
  • Parses tool calls and dispatches to ActionRegistry
  • Feeds sentences to TTS via SentenceDetector
  • Maintains conversation history with token-budget trimming
3

TTS Thread

  • Queues sentences from LLM
  • Synthesizes audio via sherpa-onnx (Piper/Kokoro)
  • Double-buffered playback (synthesizes next while playing current)
  • Outputs to CoreAudio speaker
Synchronization:
  • std::condition_variable for thread wakeup
  • std::atomic<PipelineState> for state transitions
  • Lock-free ring buffers for audio transfer

Dependencies

Vendored (deps/)

Cloned by scripts/setup.sh:
  • llama.cpp — LLM + embedding inference with Metal GPU
  • sherpa-onnx — STT/TTS/VAD via ONNX Runtime

Fetched by CMake

Automatic via FetchContent:
  • USearch v2.16.5 — HNSW vector index (header-only)
  • FTXUI v5.0.0 — Terminal UI library

macOS System Frameworks

  • CoreAudio, AudioToolbox, AudioUnit
  • Foundation, AVFoundation
  • IOKit (hardware monitoring)
  • Metal, MetalKit (GPU acceleration)

Build Outputs

build/
├── rcli                # Main CLI executable
├── rcli_test           # Test executable
├── librcli.a           # Static library (engine + actions)
└── lib/
    ├── libllama.dylib
    ├── libggml.dylib
    └── libsherpa-onnx-c-api.dylib

Configuration Files

Runtime configuration stored in ~/Library/RCLI/:
~/Library/RCLI/
├── models/             # Downloaded models
│   ├── llm/
│   ├── stt/
│   ├── tts/
│   ├── vad/
│   └── embeddings/
├── index/              # RAG indices
│   ├── vector.index
│   ├── bm25.index
│   └── metadata.json
└── config/
    ├── actions.json    # Enabled/disabled actions
    ├── active_models.json  # Active model selection
    └── settings.json   # User preferences

Next Steps

Building from Source

Build and install RCLI locally

Adding Actions

Extend RCLI with custom macOS actions

Contributing

Submit changes and improvements

Build docs developers (and LLMs) love