Pipeline Overview
RCLI implements a complete voice AI pipeline running on Apple Silicon with Metal GPU acceleration. The architecture prioritizes minimal latency through pre-allocated memory, zero-copy audio transfer, and intelligent caching.Pipeline States
The orchestrator maintains an atomic state machine with five states:std::atomic<PipelineState>. See src/core/types.h:17-23.
Threading Model
RCLI uses three dedicated threads in live mode, synchronized via condition variables:STT Thread
Role: Audio capture, VAD filtering, speech detection- Reads 100ms chunks (1600 samples @ 16kHz) from lock-free ring buffer
- Computes RMS energy to filter noise (threshold: 0.005 RMS)
- Feeds speech segments to Zipformer streaming STT
- Emits final transcripts to LLM thread via condition variable
- Thread name:
rastack.stt
src/pipeline/orchestrator.cpp:772-841 for implementation.
LLM Thread
Role: Token generation, tool calling, conversation history management- Waits on
text_cv_for transcribed text from STT thread - Spawns async TTS worker thread for parallel synthesis
- Speculative tool detection: Buffers first 15 tokens to detect tool calls before streaming
- Adaptive buffering extends window if partial tool tag detected (e.g.,
<tool) - Trims conversation history to fit context window (token-budget trimming)
- Thread name:
rastack.llm
src/pipeline/orchestrator.cpp:843-1021.
TTS Thread
Role: Sentence-level synthesis, double-buffered playback- Consumes sentences from queue populated by
SentenceDetector - Synthesizes audio while previous sentence plays (double-buffering)
- Writes directly to playback ring buffer (zero-copy)
- Thread name:
rastack.tts.live
SentenceDetector accumulates LLM tokens and flushes complete sentences based on:
- Primary breaks (
.,!,?,\n) with min 3 words - Secondary breaks (
;,:) with min 25 words (prevents long wait) - Word-level flush at 7 words if no punctuation (early TTS start)
src/pipeline/sentence_detector.cpp:6-84.
Synchronization Primitives
Lock-Free Ring Buffers
Lock-Free Ring Buffers
Single-Producer Single-Consumer (SPSC) ring buffers with atomic head/tail pointers.See
- Capture buffer: Mic → STT (16384 samples, ~1 sec @ 16kHz)
- Playback buffer: TTS → Speaker (44032 samples, ~2 sec @ 22kHz)
- Zero-copy: Direct
memcpyfrom producer to consumer - Power-of-2 sizing enables fast modulo via bitwise AND
- Cache-line aligned to prevent false sharing
src/core/ring_buffer.h:23-146.Condition Variables
Condition Variables
Used for STT → LLM communication:STT thread signals when final transcript ready:LLM thread waits:
Atomic State
Atomic State
Pipeline state is atomic for lock-free reads:Allows TUI to poll state without blocking threads.
Memory Management
Pre-Allocated Memory Pool
RCLI allocates a fixed-size memory pool at startup (64-256 MB depending on available RAM). Zero runtime malloc during inference.src/core/memory_pool.h:26-179.
Allocation Strategy
-
Huge pages (2MB superpages) for pools ≥4MB
- macOS:
vm_allocate()withVM_FLAGS_SUPERPAGE_SIZE_2MB - Linux:
mmap()+madvise(MADV_HUGEPAGE) - Reduces TLB misses by 10-15%
- macOS:
- Zero-fill at init to pre-fault pages (avoid runtime page faults)
- Cache-line alignment (64 bytes) prevents false sharing
-
Scratch regions for temporary allocations:
Pool Sizing by Hardware
| RAM Tier | Pool Size | LLM Batch | Use Case |
|---|---|---|---|
| 64+ GB | 256 MB | 4096 | Mac Studio / Pro |
| 32-48 GB | 128 MB | 2048 | M3 Max |
| 16-24 GB | 64 MB | 1024 | M3 / M2 / M1 |
| <16 GB | 64 MB | 512 | iOS / Android |
src/core/hardware_profile.h:131-146 for RAM-based pool sizing.
Design Patterns
Orchestrator Pattern
Central class owns all engines and coordinates data flow. Single point of initialization and state management.src/pipeline/orchestrator.cpp:12-88 for initialization.
System Prompt KV Caching
Llama.cpp KV cache reused across queries. System prompt (including tool definitions) cached once at init:src/pipeline/orchestrator.cpp:75-79 and src/pipeline/orchestrator.cpp:232-239.
Double-Buffered TTS
SentenceDetector queues complete sentences. TTS worker synthesizes next sentence while current one plays.
src/pipeline/orchestrator.cpp:177-196 for implementation.
Hardware Profiling at Startup
Detects CPU topology (P/E cores), RAM, Metal GPU at runtime. Configures optimal llama.cpp params.src/core/hardware_profile.h:79-309 for detection logic.
Hot-Swappable Components
LLM Model Swap
Switch LLM at runtime without restarting pipeline:src/pipeline/orchestrator.cpp:1030-1051.
Tool Calling Architecture
Hybrid Two-Tier System
Tier 1: Keyword Match (fast path, <1ms)- Matches user query against action keywords
- Scores by relevance, returns top-k actions
- Filters tool definitions sent to LLM (reduces context)
- Qwen3:
<tool_call>{"name": "...", "arguments": {...}}</tool_call> - LFM2:
<|tool_call_start|>{...}<|tool_call_end|> - Model-specific parsing via
ModelProfile
Speculative Tool Detection
Buffers first 15 tokens to detect tool calls before streaming to TTS:src/pipeline/orchestrator.cpp:215-311.
Project Structure
Next Steps
Performance
Benchmark results and optimization techniques
Configuration
Config files, environment variables, tuning