Skip to main content

Overview

RCLI’s voice pipeline is a complete on-device AI system that processes voice input through Speech-to-Text, generates responses with a Large Language Model, and outputs natural speech via Text-to-Speech—all running locally on Apple Silicon with Metal GPU acceleration.

STT

Zipformer streaming + Whisper/Parakeet offline

LLM

Qwen3 / LFM2 with tool calling

TTS

Piper / Kokoro with double-buffering

Pipeline Architecture

Mic → VAD → STT → [RAG] → LLM → TTS → Speaker
                            |
                     Tool Calling → macOS Actions

Three-Thread Model

The pipeline runs three dedicated threads synchronized via condition variables:
ThreadResponsibilitySynchronization
STTCaptures mic audio, runs VAD, detects speech endpointsPushes text to LLM queue
LLMReceives transcribed text, generates tokens, dispatches tool callsPushes sentences to TTS queue
TTSQueues sentences from LLM, double-buffered playbackSignals completion to orchestrator
From src/pipeline/orchestrator.cpp:18-89 - The orchestrator initializes all engines with a pre-allocated 64MB memory pool and lock-free ring buffers for zero-copy audio transfer.

Speech-to-Text (STT)

RCLI uses a dual-STT architecture for optimal latency and accuracy:

Streaming STT (Zipformer)

  • Model: Zipformer transducer from k2-fsa/sherpa-onnx
  • Size: 50 MB
  • Latency: ~44ms (0.022x real-time factor)
  • Use Case: Live transcription during rcli listen mode
  • Architecture: Streaming transducer with online chunking

Offline STT (Whisper / Parakeet)

  • Size: 140 MB
  • WER: ~5% on LibriSpeech
  • Languages: English only
  • Use Case: Push-to-talk in TUI, batch processing

Voice Activity Detection (VAD)

  • Model: Silero VAD v4 (0.6 MB)
  • Window: 512 samples (32ms at 16kHz)
  • Thresholds: Speech start 0.5, end 0.35
  • Purpose: Filters silence, detects speech endpoints
// From src/engines/vad_engine.cpp
float prob = vad_->process(audio_chunk, 512);
if (prob > 0.5) {
    speech_detected = true;
}

Large Language Model (LLM)

RCLI supports 9 LLM models with hot-swappable runtime switching:

Default: LFM2 1.2B Tool

  • Size: 731 MB (Q4_K_M quantization)
  • Speed: ~180 tokens/sec on M3 Max
  • Context: 128K tokens
  • Features: Native tool calling with <|tool_call_start|> format

Optimizations

The system prompt (including tool definitions) is cached in the LLM’s KV cache on startup. Subsequent queries reuse this cached state, reducing time-to-first-token by 50-70%.
// From src/engines/llm_engine.cpp
llm_.cache_system_prompt(tool_system_prompt);
// Reused for every user query
Enabled by default in llama.cpp for 15-25% faster inference. Uses fused kernels for attention computation with O(N) memory instead of O(N²).
All 99 layers offloaded to Metal GPU by default. Achieves 3-5x speedup over CPU-only inference on Apple Silicon.
Conversation history is trimmed when context usage exceeds 75% of the context window. Oldest messages are evicted first, preserving system prompt and recent context.

Tool Calling

RCLI implements LLM-native tool calling with model-specific formats:
<tool_call>
{"name": "open_app", "arguments": {"app_name": "Safari"}}
</tool_call>

Two-Tier Detection

  1. Tier 1: Keyword pattern matching (e.g., “open”, “play”, “create”)
  2. Tier 2: LLM extracts structured tool call from generated text
From src/tools/tool_engine.cpp:85-150 - The tool engine maintains a filtered set of top-k relevant tools based on the user query to reduce context overhead for small LLMs.

Text-to-Speech (TTS)

RCLI uses sherpa-onnx for TTS with multiple voice options:

Default: Piper Lessac

  • Size: 60 MB
  • Quality: Good (single speaker)
  • Latency: ~150ms on M3 Max
  • Architecture: Tacotron2 + HiFi-GAN vocoder

Sentence-Level Streaming

The LLM output is split into sentences on-the-fly and synthesized incrementally:
// From src/pipeline/sentence_detector.cpp
while (llm_generating) {
    sentence = detector.get_next_sentence();
    if (sentence.complete) {
        tts_queue.push(sentence.text);
    }
}

Double-Buffered Playback

While the current sentence plays, the next sentence is synthesized in parallel:
Sentence 1: [========== Synthesizing ==========]
Sentence 2:                                      [Waiting]

Sentence 1:                                      [Playing]
Sentence 2: [========== Synthesizing ==========]
This overlap reduces perceived latency and creates a more natural conversational flow.

Memory Management

Pre-Allocated Pool

RCLI pre-allocates a 64MB memory pool at startup using mmap() with MAP_ANONYMOUS | MAP_PRIVATE:
// From src/core/memory_pool.h
pool_ = mmap(nullptr, 64 * 1024 * 1024, 
             PROT_READ | PROT_WRITE,
             MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
Benefits:
  • Zero malloc() calls during inference
  • Reduced memory fragmentation
  • Predictable memory layout for cache efficiency

Lock-Free Ring Buffers

Audio data flows through lock-free ring buffers between threads:
  • Capture Buffer: Mic → STT (1.5 seconds, 24K samples)
  • Playback Buffer: TTS → Speaker (5 seconds, 80K samples)
// From src/core/ring_buffer.h
template<typename T>
class RingBuffer {
    std::atomic<size_t> write_pos_;
    std::atomic<size_t> read_pos_;
    // Zero-copy push/pop with CAS operations
};

Performance Benchmarks

STT Latency

43.7 ms avg (0.022x RTF)

LLM TTFT

22.5 ms time-to-first-token

LLM Throughput

159.6 tok/s on M3 Max

TTS Latency

150.6 ms synthesis time

E2E Latency

131 ms voice-in to audio-out

RAG Retrieval

3.82 ms hybrid search
Benchmarked on Apple M3 Max (14-core CPU, 30-core GPU, 36 GB RAM). Run rcli bench to measure performance on your system.

Pipeline States

The orchestrator maintains an atomic pipeline state:
enum class PipelineState {
    IDLE,         // Ready for input
    LISTENING,    // STT capturing audio
    PROCESSING,   // LLM generating response
    SPEAKING,     // TTS playing audio
    INTERRUPTED   // User stopped processing
};
Transitions are thread-safe and trigger state callbacks for UI updates.

Next Steps

macOS Actions

Learn about the 43 macOS actions triggered by tool calling

RAG System

Understand the hybrid retrieval system for document queries

Architecture

Deep dive into threading model and design patterns

Performance

Optimization techniques and benchmarking

Build docs developers (and LLMs) love