Overview
RCLI’s voice pipeline is a complete on-device AI system that processes voice input through Speech-to-Text, generates responses with a Large Language Model, and outputs natural speech via Text-to-Speech—all running locally on Apple Silicon with Metal GPU acceleration.STT
Zipformer streaming + Whisper/Parakeet offline
LLM
Qwen3 / LFM2 with tool calling
TTS
Piper / Kokoro with double-buffering
Pipeline Architecture
Three-Thread Model
The pipeline runs three dedicated threads synchronized via condition variables:| Thread | Responsibility | Synchronization |
|---|---|---|
| STT | Captures mic audio, runs VAD, detects speech endpoints | Pushes text to LLM queue |
| LLM | Receives transcribed text, generates tokens, dispatches tool calls | Pushes sentences to TTS queue |
| TTS | Queues sentences from LLM, double-buffered playback | Signals completion to orchestrator |
From
src/pipeline/orchestrator.cpp:18-89 - The orchestrator initializes all engines with a pre-allocated 64MB memory pool and lock-free ring buffers for zero-copy audio transfer.Speech-to-Text (STT)
RCLI uses a dual-STT architecture for optimal latency and accuracy:Streaming STT (Zipformer)
- Model: Zipformer transducer from k2-fsa/sherpa-onnx
- Size: 50 MB
- Latency: ~44ms (0.022x real-time factor)
- Use Case: Live transcription during
rcli listenmode - Architecture: Streaming transducer with online chunking
Offline STT (Whisper / Parakeet)
- Whisper base.en
- Parakeet TDT 0.6B
- Size: 140 MB
- WER: ~5% on LibriSpeech
- Languages: English only
- Use Case: Push-to-talk in TUI, batch processing
Voice Activity Detection (VAD)
- Model: Silero VAD v4 (0.6 MB)
- Window: 512 samples (32ms at 16kHz)
- Thresholds: Speech start 0.5, end 0.35
- Purpose: Filters silence, detects speech endpoints
Large Language Model (LLM)
RCLI supports 9 LLM models with hot-swappable runtime switching:Default: LFM2 1.2B Tool
- Size: 731 MB (Q4_K_M quantization)
- Speed: ~180 tokens/sec on M3 Max
- Context: 128K tokens
- Features: Native tool calling with
<|tool_call_start|>format
Optimizations
System Prompt KV Caching
System Prompt KV Caching
The system prompt (including tool definitions) is cached in the LLM’s KV cache on startup. Subsequent queries reuse this cached state, reducing time-to-first-token by 50-70%.
Flash Attention
Flash Attention
Enabled by default in llama.cpp for 15-25% faster inference. Uses fused kernels for attention computation with O(N) memory instead of O(N²).
Metal GPU Offload
Metal GPU Offload
All 99 layers offloaded to Metal GPU by default. Achieves 3-5x speedup over CPU-only inference on Apple Silicon.
Token Budget Trimming
Token Budget Trimming
Conversation history is trimmed when context usage exceeds 75% of the context window. Oldest messages are evicted first, preserving system prompt and recent context.
Tool Calling
RCLI implements LLM-native tool calling with model-specific formats:Two-Tier Detection
- Tier 1: Keyword pattern matching (e.g., “open”, “play”, “create”)
- Tier 2: LLM extracts structured tool call from generated text
From
src/tools/tool_engine.cpp:85-150 - The tool engine maintains a filtered set of top-k relevant tools based on the user query to reduce context overhead for small LLMs.Text-to-Speech (TTS)
RCLI uses sherpa-onnx for TTS with multiple voice options:Default: Piper Lessac
- Size: 60 MB
- Quality: Good (single speaker)
- Latency: ~150ms on M3 Max
- Architecture: Tacotron2 + HiFi-GAN vocoder
Sentence-Level Streaming
The LLM output is split into sentences on-the-fly and synthesized incrementally:Double-Buffered Playback
While the current sentence plays, the next sentence is synthesized in parallel:Memory Management
Pre-Allocated Pool
RCLI pre-allocates a 64MB memory pool at startup usingmmap() with MAP_ANONYMOUS | MAP_PRIVATE:
- Zero
malloc()calls during inference - Reduced memory fragmentation
- Predictable memory layout for cache efficiency
Lock-Free Ring Buffers
Audio data flows through lock-free ring buffers between threads:- Capture Buffer: Mic → STT (1.5 seconds, 24K samples)
- Playback Buffer: TTS → Speaker (5 seconds, 80K samples)
Performance Benchmarks
STT Latency
43.7 ms avg (0.022x RTF)
LLM TTFT
22.5 ms time-to-first-token
LLM Throughput
159.6 tok/s on M3 Max
TTS Latency
150.6 ms synthesis time
E2E Latency
131 ms voice-in to audio-out
RAG Retrieval
3.82 ms hybrid search
Benchmarked on Apple M3 Max (14-core CPU, 30-core GPU, 36 GB RAM). Run
rcli bench to measure performance on your system.Pipeline States
The orchestrator maintains an atomic pipeline state:Next Steps
macOS Actions
Learn about the 43 macOS actions triggered by tool calling
RAG System
Understand the hybrid retrieval system for document queries
Architecture
Deep dive into threading model and design patterns
Performance
Optimization techniques and benchmarking