High-Level Architecture
- Engines — ML inference wrappers (STT, LLM, TTS, VAD, embeddings)
- Pipeline — Orchestrator coordinates data flow between engines
- RAG — Hybrid retrieval (vector + BM25) over local documents
- Actions — 43 macOS integrations via AppleScript and shell
- CLI — Interactive TUI and command-line interface
Directory Structure
src/ Modules
engines/
ML inference wrappers for each modality:engines/ Contents
engines/ Contents
| File | Purpose |
|---|---|
stt_engine.cpp/.h | Speech-to-text via sherpa-onnx (Zipformer, Whisper, Parakeet) |
llm_engine.cpp/.h | LLM inference via llama.cpp with Metal GPU |
tts_engine.cpp/.h | Text-to-speech via sherpa-onnx (Piper, Kokoro, KittenTTS) |
vad_engine.cpp/.h | Voice activity detection (Silero VAD) |
embedding_engine.cpp/.h | Text embeddings for RAG (Snowflake Arctic Embed) |
model_profile.cpp/.h | Model metadata, chat templates, tool call parsing |
- Each engine wraps a C API (
llama.cpp,sherpa-onnx) - Engines are initialized once and reused across queries
- Metal GPU acceleration for LLM and embeddings
- ONNX Runtime for STT/TTS/VAD
pipeline/
Orchestrates data flow between engines:pipeline/ Contents
pipeline/ Contents
| File | Purpose |
|---|---|
orchestrator.cpp/.h | Central class that owns all engines and coordinates the pipeline |
sentence_detector.cpp/.h | Accumulates LLM tokens and flushes complete sentences to TTS |
text_sanitizer.h | Removes non-speech text (markdown, XML tags) before TTS |
- Manages pipeline state (IDLE → LISTENING → PROCESSING → SPEAKING)
- Runs STT/LLM/TTS threads
- Dispatches tool calls to
ActionRegistry - Maintains conversation history with token-budget trimming
- System prompt KV caching for fast response
rag/
Hybrid retrieval system for local documents:rag/ Contents
rag/ Contents
| File | Purpose |
|---|---|
vector_index.cpp/.h | HNSW vector search via USearch |
bm25_index.cpp/.h | Full-text search with BM25 ranking |
hybrid_retriever.cpp/.h | Combines vector + BM25 via Reciprocal Rank Fusion |
document_processor.cpp/.h | Chunks documents (PDF, DOCX, TXT) into 512-token segments |
index_builder.cpp/.h | Builds and persists indices |
- Query is embedded via
embedding_engine - Vector search (HNSW) finds nearest chunks
- BM25 search finds keyword-matching chunks
- Results fused via RRF (Reciprocal Rank Fusion)
- Top-k chunks injected into LLM context
core/
Core types and utilities:core/ Contents
core/ Contents
| File | Purpose |
|---|---|
types.h | Shared types (ToolCall, ToolResult, PipelineState, etc.) |
ring_buffer.h | Lock-free ring buffer for zero-copy audio transfer |
memory_pool.h | Pre-allocated 64 MB arena (no runtime malloc) |
hardware_profile.h | Detects P-cores, E-cores, Metal GPU, RAM |
log.h | Logging macros (LOG_INFO, LOG_ERROR) |
base64.h | Base64 encoding/decoding |
string_utils.h | String manipulation utilities |
file_utils.h | File I/O helpers |
- Lock-free ring buffer — zero-copy audio passing between threads
- Pre-allocated memory pool — 64 MB arena allocated at init
- Hardware profiling — adapts thread count and GPU layers to hardware
audio/
CoreAudio microphone and speaker I/O:audio/ Contents
audio/ Contents
| File | Purpose |
|---|---|
audio_io.cpp/.h | CoreAudio input/output streams |
mic_permission.h/.mm | Microphone permission request (Objective-C) |
- 16 kHz mono capture for STT
- 24 kHz mono playback for TTS
- Buffer size: 512 samples (32ms at 16 kHz)
- Minimal latency configuration
tools/
Tool calling engine:tools/ Contents
tools/ Contents
| File | Purpose |
|---|---|
tool_engine.cpp/.h | Parses LLM tool calls and dispatches to ActionRegistry |
- LLM generates tool call in model-native format (e.g., Qwen3’s
<tool_call>) ToolEngineparses viaModelProfile::parse_tool_calls()- Dispatches to
ActionRegistry::execute() - Returns result to LLM
- Qwen3:
<tool_call>{...}</tool_call> - LFM2:
<|tool_call_start|>{...}<|tool_call_end|> - Generic JSON:
{"name": "...", "arguments": {...}}
bench/
Benchmark harness:bench/ Contents
bench/ Contents
| File | Purpose |
|---|---|
benchmark.cpp/.h | Runs STT, LLM, TTS, E2E, RAG, tools, memory benchmarks |
stt— Transcription latency and accuracyllm— Time to first token, throughput (tok/s)tts— Synthesis latencye2e— Voice-in to audio-out latencyrag— Retrieval latency (vector + BM25)tools— Tool calling accuracy and latencymemory— Peak memory usageall— All suites
actions/
macOS action implementations:actions/ Contents
actions/ Contents
| File | Purpose |
|---|---|
action_registry.cpp/.h | Registers actions and dispatches execution |
action_helpers.h | JSON parsing, string escaping utilities |
applescript_executor.cpp/.h | Executes AppleScript and shell commands |
register_all.cpp | Calls all registration functions |
| Category files: | |
notes_actions.cpp/.h | Apple Notes integration |
reminders_actions.cpp/.h | Reminders integration |
messages_actions.cpp/.h | Messages/iMessage |
app_control_actions.cpp/.h | Open/quit apps |
window_actions.cpp/.h | Window management |
system_actions.cpp/.h | System settings (volume, dark mode, lock) |
media_actions.cpp/.h | Spotify/Apple Music |
web_actions.cpp/.h | Web search |
browser_actions.cpp/.h | Safari/Chrome control |
clipboard_actions.cpp/.h | Clipboard read/write |
files_actions.cpp/.h | File search |
navigation_actions.cpp/.h | Maps integration |
communication_actions.cpp/.h | FaceTime |
api/
Public C API:api/ Contents
api/ Contents
| File | Purpose |
|---|---|
rcli_api.h | Public C API header (all engine functionality) |
rcli_api.cpp | API implementation |
rcli_init()— Initialize pipelinercli_query()— One-shot text queryrcli_start_listen()— Start continuous voice modercli_stop_listen()— Stop listeningrcli_cleanup()— Shutdown pipeline
cli/
CLI and TUI:cli/ Contents
cli/ Contents
| File | Purpose |
|---|---|
main.cpp | Entry point, argument parsing, command dispatch |
tui_dashboard.h | Interactive TUI dashboard (FTXUI) |
tui_app.h | TUI event loop |
actions_cli.h | Actions panel (browse, enable/disable, execute) |
model_pickers.h | Model management (LLM, STT, TTS) |
help.h | CLI help text |
setup_cmds.h | rcli setup, rcli cleanup commands |
visualizer.h | Waveform visualizer |
cli_common.h | Shared CLI utilities |
- Push-to-talk (SPACE bar)
- Models panel (M) — browse, download, hot-swap
- Actions panel (A) — enable/disable actions
- Benchmarks panel (B) — run performance tests
- RAG panel (R) — ingest documents
- Cleanup panel (D) — remove unused models
- Tool call trace (T) — debug LLM tool calls
models/
Model registries:models/ Contents
models/ Contents
| File | Purpose |
|---|---|
model_registry.h | LLM model definitions (id, URL, size, speed, tool calling) |
tts_model_registry.h | TTS voice definitions |
stt_model_registry.h | STT model definitions |
- Download URL (Hugging Face)
- Size (MB)
- Speed estimate (tokens/sec)
- Tool calling capability
- Default/recommended flags
test/
Test harness:test/ Contents
test/ Contents
| File | Purpose |
|---|---|
test_pipeline.cpp | Pipeline integration tests |
--actions-only— Fast, no models needed--llm-only— LLM inference tests--stt-only— STT transcription tests--tts-only— TTS synthesis tests--api-only— C API tests
Key Design Patterns
Orchestrator Pattern
TheOrchestrator class owns all engines and coordinates data flow:
src/pipeline/orchestrator.h
- Single source of truth for pipeline state
- Simplified thread coordination
- Easy to add new engines
Lock-Free Ring Buffer
Zero-copy audio transfer between threads:src/core/ring_buffer.h
- No mutex contention
- Zero-copy (pointers only)
- Fixed allocation (no runtime malloc)
Pre-Allocated Memory Pool
64 MB arena allocated at init:src/core/memory_pool.h
- No runtime malloc during inference
- Predictable latency
- Cache-friendly (contiguous memory)
System Prompt KV Caching
Reuses llama.cpp KV cache across queries:src/engines/llm_engine.cpp
- Avoids reprocessing system prompt (saves ~20-30ms)
- Lower latency for multi-turn conversations
Sentence-Level TTS Scheduling
TTS synthesizes complete sentences, not token-by-token:src/pipeline/sentence_detector.cpp
- Natural prosody (TTS sees full sentences)
- Double-buffered playback (next sentence synthesizes while current plays)
- Lower latency than waiting for full LLM response
Threading Model
Three threads run concurrently in live mode:STT Thread
- Captures mic audio via CoreAudio
- Runs Silero VAD to filter silence
- Detects speech endpoints
- Transcribes via Zipformer (streaming) or Whisper (batch)
- Signals LLM thread when transcription is ready
LLM Thread
- Waits for STT output (
std::condition_variable) - Generates tokens via llama.cpp with Metal GPU
- Parses tool calls and dispatches to
ActionRegistry - Feeds sentences to TTS via
SentenceDetector - Maintains conversation history with token-budget trimming
std::condition_variablefor thread wakeupstd::atomic<PipelineState>for state transitions- Lock-free ring buffers for audio transfer
Dependencies
Vendored (deps/)
Cloned byscripts/setup.sh:
- llama.cpp — LLM + embedding inference with Metal GPU
- sherpa-onnx — STT/TTS/VAD via ONNX Runtime
Fetched by CMake
Automatic viaFetchContent:
- USearch v2.16.5 — HNSW vector index (header-only)
- FTXUI v5.0.0 — Terminal UI library
macOS System Frameworks
- CoreAudio, AudioToolbox, AudioUnit
- Foundation, AVFoundation
- IOKit (hardware monitoring)
- Metal, MetalKit (GPU acceleration)
Build Outputs
Configuration Files
Runtime configuration stored in~/Library/RCLI/:
Next Steps
Building from Source
Build and install RCLI locally
Adding Actions
Extend RCLI with custom macOS actions
Contributing
Submit changes and improvements