Voice Pipeline

Overview

RCLI’s voice pipeline is a complete on-device AI system that processes voice input through Speech-to-Text, generates responses with a Large Language Model, and outputs natural speech via Text-to-Speech—all running locally on Apple Silicon with Metal GPU acceleration.

STT

Zipformer streaming + Whisper/Parakeet offline

LLM

Qwen3 / LFM2 with tool calling

TTS

Piper / Kokoro with double-buffering

Pipeline Architecture

Mic → VAD → STT → [RAG] → LLM → TTS → Speaker
                            |
                     Tool Calling → macOS Actions

Three-Thread Model

The pipeline runs three dedicated threads synchronized via condition variables:

Thread	Responsibility	Synchronization
STT	Captures mic audio, runs VAD, detects speech endpoints	Pushes text to LLM queue
LLM	Receives transcribed text, generates tokens, dispatches tool calls	Pushes sentences to TTS queue
TTS	Queues sentences from LLM, double-buffered playback	Signals completion to orchestrator

From src/pipeline/orchestrator.cpp:18-89 - The orchestrator initializes all engines with a pre-allocated 64MB memory pool and lock-free ring buffers for zero-copy audio transfer.

Speech-to-Text (STT)

RCLI uses a dual-STT architecture for optimal latency and accuracy:

Streaming STT (Zipformer)

Model: Zipformer transducer from k2-fsa/sherpa-onnx
Size: 50 MB
Latency: ~44ms (0.022x real-time factor)
Use Case: Live transcription during rcli listen mode
Architecture: Streaming transducer with online chunking

Offline STT (Whisper / Parakeet)

Whisper base.en
Parakeet TDT 0.6B

Size: 140 MB
WER: ~5% on LibriSpeech
Languages: English only
Use Case: Push-to-talk in TUI, batch processing

Voice Activity Detection (VAD)

Model: Silero VAD v4 (0.6 MB)
Window: 512 samples (32ms at 16kHz)
Thresholds: Speech start 0.5, end 0.35
Purpose: Filters silence, detects speech endpoints

// From src/engines/vad_engine.cpp
float prob = vad_->process(audio_chunk, 512);
if (prob > 0.5) {
    speech_detected = true;
}

Large Language Model (LLM)

RCLI supports 9 LLM models with hot-swappable runtime switching:

Default: LFM2 1.2B Tool

Size: 731 MB (Q4_K_M quantization)
Speed: ~180 tokens/sec on M3 Max
Context: 128K tokens
Features: Native tool calling with <|tool_call_start|> format

Optimizations

System Prompt KV Caching

The system prompt (including tool definitions) is cached in the LLM’s KV cache on startup. Subsequent queries reuse this cached state, reducing time-to-first-token by 50-70%.

// From src/engines/llm_engine.cpp
llm_.cache_system_prompt(tool_system_prompt);
// Reused for every user query

Flash Attention

Enabled by default in llama.cpp for 15-25% faster inference. Uses fused kernels for attention computation with O(N) memory instead of O(N²).

Metal GPU Offload

All 99 layers offloaded to Metal GPU by default. Achieves 3-5x speedup over CPU-only inference on Apple Silicon.

Token Budget Trimming

Conversation history is trimmed when context usage exceeds 75% of the context window. Oldest messages are evicted first, preserving system prompt and recent context.

Tool Calling

RCLI implements LLM-native tool calling with model-specific formats:

<tool_call>
{"name": "open_app", "arguments": {"app_name": "Safari"}}
</tool_call>

Two-Tier Detection

Tier 1: Keyword pattern matching (e.g., “open”, “play”, “create”)
Tier 2: LLM extracts structured tool call from generated text

From src/tools/tool_engine.cpp:85-150 - The tool engine maintains a filtered set of top-k relevant tools based on the user query to reduce context overhead for small LLMs.

Text-to-Speech (TTS)

RCLI uses sherpa-onnx for TTS with multiple voice options:

Default: Piper Lessac

Size: 60 MB
Quality: Good (single speaker)
Latency: ~150ms on M3 Max
Architecture: Tacotron2 + HiFi-GAN vocoder

Sentence-Level Streaming

The LLM output is split into sentences on-the-fly and synthesized incrementally:

// From src/pipeline/sentence_detector.cpp
while (llm_generating) {
    sentence = detector.get_next_sentence();
    if (sentence.complete) {
        tts_queue.push(sentence.text);
    }
}

Double-Buffered Playback

While the current sentence plays, the next sentence is synthesized in parallel:

Sentence 1: [========== Synthesizing ==========]
Sentence 2:                                      [Waiting]

Sentence 1:                                      [Playing]
Sentence 2: [========== Synthesizing ==========]

This overlap reduces perceived latency and creates a more natural conversational flow.

Memory Management

Pre-Allocated Pool

RCLI pre-allocates a 64MB memory pool at startup using mmap() with MAP_ANONYMOUS | MAP_PRIVATE:

// From src/core/memory_pool.h
pool_ = mmap(nullptr, 64 * 1024 * 1024, 
             PROT_READ | PROT_WRITE,
             MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

Benefits:

Zero malloc() calls during inference
Reduced memory fragmentation
Predictable memory layout for cache efficiency

Lock-Free Ring Buffers

Audio data flows through lock-free ring buffers between threads:

Capture Buffer: Mic → STT (1.5 seconds, 24K samples)
Playback Buffer: TTS → Speaker (5 seconds, 80K samples)

// From src/core/ring_buffer.h
template<typename T>
class RingBuffer {
    std::atomic<size_t> write_pos_;
    std::atomic<size_t> read_pos_;
    // Zero-copy push/pop with CAS operations
};

Performance Benchmarks

STT Latency

43.7 ms avg (0.022x RTF)

LLM TTFT

22.5 ms time-to-first-token

LLM Throughput

159.6 tok/s on M3 Max

TTS Latency

150.6 ms synthesis time

E2E Latency

131 ms voice-in to audio-out

RAG Retrieval

3.82 ms hybrid search

Benchmarked on Apple M3 Max (14-core CPU, 30-core GPU, 36 GB RAM). Run rcli bench to measure performance on your system.

Pipeline States

The orchestrator maintains an atomic pipeline state:

enum class PipelineState {
    IDLE,         // Ready for input
    LISTENING,    // STT capturing audio
    PROCESSING,   // LLM generating response
    SPEAKING,     // TTS playing audio
    INTERRUPTED   // User stopped processing
};

Transitions are thread-safe and trigger state callbacks for UI updates.

Next Steps

macOS Actions

Learn about the 43 macOS actions triggered by tool calling

RAG System

Understand the hybrid retrieval system for document queries

Architecture

Deep dive into threading model and design patterns

Performance

Optimization techniques and benchmarking

Get Started

Core Features

Commands

Models

Actions

Advanced

Development

​Overview

STT

LLM

TTS

​Pipeline Architecture

​Three-Thread Model

​Speech-to-Text (STT)

​Streaming STT (Zipformer)

​Offline STT (Whisper / Parakeet)

​Voice Activity Detection (VAD)

​Large Language Model (LLM)

​Default: LFM2 1.2B Tool

​Optimizations

​Tool Calling

​Two-Tier Detection

​Text-to-Speech (TTS)

​Default: Piper Lessac

​Sentence-Level Streaming

​Double-Buffered Playback

​Memory Management

​Pre-Allocated Pool

​Lock-Free Ring Buffers

​Performance Benchmarks

STT Latency

LLM TTFT

LLM Throughput

TTS Latency

E2E Latency

RAG Retrieval

​Pipeline States

​Next Steps

macOS Actions

RAG System

Architecture

Performance

Build docs developers (and LLMs) love

Overview

Pipeline Architecture

Three-Thread Model

Speech-to-Text (STT)

Streaming STT (Zipformer)

Offline STT (Whisper / Parakeet)

Voice Activity Detection (VAD)

Large Language Model (LLM)

Default: LFM2 1.2B Tool

Optimizations

Tool Calling

Two-Tier Detection

Text-to-Speech (TTS)

Default: Piper Lessac

Sentence-Level Streaming

Double-Buffered Playback

Memory Management

Pre-Allocated Pool

Lock-Free Ring Buffers

Performance Benchmarks

Pipeline States

Next Steps