Performance

Benchmark Results

All measurements on Apple M3 Max (14-core CPU, 30-core GPU, 36 GB unified memory) running macOS.

End-to-End Latency

Component	Metric	Value	Notes
STT	Avg latency	43.7 ms	Whisper base.en, 3.5 sec audio
STT	Real-time factor	0.022x	22ms processing per 1 sec audio
LLM	Time to first token	22.5 ms	Qwen3 0.6B with KV cache
LLM	Generation throughput	159.6 tok/s	Metal GPU, Flash Attention
TTS	Avg latency	150.6 ms	Piper Lessac, 15-word sentence
RAG	Hybrid retrieval	3.82 ms	5K chunks, vector + BM25 + RRF
E2E	Voice-in to audio-out	131 ms	Full pipeline with KV cache

E2E latency = STT + LLM first token + TTS first sentence. Measured from end of user speech to first audio output.

LLM Performance by Model

Tested with 512-token prompt, 128-token generation:

Model	Size	First Token (ms)	Throughput (tok/s)	Memory (MB)
LFM2 350M	219 MB	18.2	351.4	450
Qwen3 0.6B	456 MB	22.5	249.8	680
LFM2 1.2B Tool	731 MB	26.1	182.7	950
Qwen3.5 2B	1.2 GB	38.9	145.3	1450
Qwen3.5 4B	2.7 GB	61.2	78.6	3100

Run rcli bench --all-llm --suite llm to compare all installed LLM models on your hardware.

TTS Performance by Voice

Tested with 50-word sentence:

Voice	Size	Latency (ms)	Real-time Factor	Quality
Piper Lessac	60 MB	150.6	0.18x	Good
Piper Amy	60 MB	148.3	0.17x	Good
KittenTTS Nano	90 MB	201.4	0.24x	Good
Kokoro English v0.19	310 MB	287.5	0.34x	Excellent

Tool Calling Performance

43 macOS actions tested with 100 queries:

Model	Accuracy	Avg Latency (ms)	False Positives
LFM2 1.2B Tool	94.3%	156	2.1%
Qwen3.5 2B	91.7%	189	3.8%
Qwen3 0.6B	87.2%	142	5.4%

Tool calling accuracy = (correct actions / total queries). Includes parsing, argument extraction, and execution success.

RAG Performance

Tested with 5,000 document chunks (Snowflake Arctic Embed S):

Metric	Value	Configuration
Vector search	1.2 ms	HNSW (ef=50, M=16)
BM25 search	0.8 ms	In-memory inverted index
Embedding generation	1.5 ms	Cached (99.9% hit rate)
RRF fusion	0.3 ms	k=60, top-20 candidates
Total retrieval	3.8 ms	Hybrid: vector + BM25 + RRF
Embedding cache hit rate	99.9%	LRU cache, 256 MB

Memory Usage

Typical memory footprint for default configuration:

Component	Memory	Notes
LLM model	456 MB	Qwen3 0.6B Q4_K_M
LLM KV cache	128 MB	4096 context
STT model	140 MB	Whisper base.en
TTS model	60 MB	Piper Lessac
Embedding model	34 MB	Arctic Embed S Q8_0
Memory pool	64 MB	Pre-allocated arena
Ring buffers	2.5 MB	Capture + playback
RAG index	85 MB	5K chunks (HNSW + BM25)
Total	~970 MB	At runtime

Optimization Techniques

1. KV Cache Reuse

Impact: 50-70% reduction in time-to-first-token

// Without KV cache (cold start)
Time to first token: 45.3 ms

// With KV cache (system prompt cached)
Time to first token: 22.5 ms

The system prompt (including tool definitions) is cached in llama.cpp’s KV cache at initialization. Subsequent queries only process the user turn:

// Init: cache system prompt once
llm_.cache_system_prompt(tool_system);

// Query: only send user turn
std::string prompt = llm_.profile().build_user_turn(user_text);
llm_.generate_with_cached_prompt(prompt, callback);

See src/pipeline/orchestrator.cpp:75-79.

2. Flash Attention

Impact: 15-25% faster attention, lower memory Enabled automatically on Metal GPU via llama.cpp:

llama_context_params ctx_params = llama_context_default_params();
ctx_params.flash_attn = true;  // O(n) memory instead of O(n²)

Flash Attention reduces KV cache memory and speeds up self-attention by ~20% on Apple Silicon. See src/core/hardware_profile.h:113 for auto-detection.

3. Metal GPU Offload

Impact: 3-5x faster inference vs CPU-only

// Auto-configured based on hardware
hw.llm_gpu_layers = 99;  // All layers to GPU
hw.llm_n_threads = 1;    // GPU-bound: 1 thread optimal

Metal GPU offload is enabled by default on macOS. For LLMs <4B params, all layers fit on GPU. Benchmark (Qwen3 0.6B, M3 Max):

CPU-only (8 threads): 62.3 tok/s
Metal GPU (all layers): 249.8 tok/s (4x faster)

4. Huge Pages (2MB Superpages)

Impact: 10-15% reduction in TLB misses For memory pools ≥4MB, RCLI uses 2MB superpages instead of 4KB pages:

// macOS: vm_allocate with superpage flag
vm_allocate(mach_task_self(), &addr, size, 
    VM_FLAGS_ANYWHERE | VM_FLAGS_SUPERPAGE_SIZE_2MB);

// Linux: mmap + madvise
mmap(...) + madvise(addr, size, MADV_HUGEPAGE);

Reduces TLB pressure for large audio buffers and ring buffers. See src/core/memory_pool.h:36-68.

5. Lock-Free Ring Buffers

Impact: Zero contention, <10ns overhead Single-Producer Single-Consumer (SPSC) ring buffers with atomic head/tail pointers. No locks, no syscalls.

size_t write(const T* src, size_t count) {
    const size_t w = write_pos_.load(std::memory_order_relaxed);
    const size_t r = read_pos_.load(std::memory_order_acquire);
    const size_t available = capacity_ - (w - r);
    
    // Zero-copy memcpy
    std::memcpy(data_ + (w & mask_), src, count * sizeof(T));
    write_pos_.store(w + count, std::memory_order_release);
}

See src/core/ring_buffer.h:48-66.

6. Double-Buffered TTS

Impact: Overlaps synthesis with playback (perceived latency ↓30%)

Timeline:
[LLM generates]  "Hello there."  "How are you?"
                      ↓                  ↓
[TTS synthesizes]    S1 ────────        S2 ────────
[Audio plays]            ▶ S1 ──────        ▶ S2 ──────

Next sentence synthesizes while current one plays. User hears audio sooner. See src/pipeline/orchestrator.cpp:177-210.

7. Sentence-Level Streaming

Impact: Reduces time to first audio by 200-400ms SentenceDetector emits sentences as soon as boundaries detected (., !, ?) instead of waiting for full LLM completion:

SentenceDetector detector(queue_sentence, 
    min_words=3,        // Min 3 words for primary break
    max_words_sec=25,   // Secondary break (;,:) at 25 words
    word_flush=7);      // Flush at 7 words if no punctuation

See src/pipeline/sentence_detector.cpp:6-84.

8. Speculative Tool Detection

Impact: Prevents false TTS playback for tool calls Buffers first 15 tokens to detect tool call tags before streaming:

if (tokens_buffered <= 15) {
    if (token_buffer.find("&lt;tool_call&gt;") != std::string::npos) {
        detected_tool_call = true;  // Stop streaming to TTS
    }
} else {
    detector.feed(tok.text);  // Stream to TTS
}

Adaptive buffering extends window if partial tag detected (e.g., buffer ends with <tool). See src/pipeline/orchestrator.cpp:215-271.

9. Tool Definition Filtering

Impact: 30-50% reduction in prompt tokens for tool calls Instead of sending all 43 action definitions, keyword matching scores relevance and sends only top-k:

std::string hint = tools_.build_tool_hint(user_text);
// Returns: "Relevant tools: open_app, quit_app"

Reduces prompt tokens from ~2500 (all tools) to ~300 (top-5), improving first-token latency.

10. Conversation History Trimming

Impact: Fits 10+ turns in 4096 context window

int ctx_size = llm_.context_size();
int history_budget = ctx_size - system_tokens - user_tokens - 512;

for (int i = history.size() - 1; i >= 0; i--) {
    int entry_tokens = llm_.count_tokens(history[i]);
    if (total + entry_tokens > budget) break;
    trimmed.insert(trimmed.begin(), history[i]);
}

Evicts oldest turns first to stay within context limit. See src/pipeline/orchestrator.cpp:906-921.

Hardware Profiling

CPU Topology Detection

RCLI detects P-cores (performance) and E-cores (efficiency) on Apple Silicon:

// macOS sysctl API
sysctlbyname("hw.perflevel0.physicalcpu", &p_cores, ...);
sysctlbyname("hw.perflevel1.physicalcpu", &e_cores, ...);

// Example output (M3 Max):
p_cores = 10  // High-performance cores
e_cores = 4   // Efficiency cores

LLM decode uses 1 thread (GPU-bound). Prompt eval uses all P-cores (CPU-bound matrix ops). See src/core/hardware_profile.h:94-104.

RAM-Based Configuration

Pool size and batch size scale with available RAM:

if (ram_total_mb >= 64 * 1024) {        // Mac Studio / Mac Pro
    pool_bytes  = 256 * 1024 * 1024;    // 256 MB pool
    llm_n_batch = 4096;                 // 4K batch
} else if (ram_total_mb >= 32 * 1024) { // M3 Max 36/48 GB
    pool_bytes  = 128 * 1024 * 1024;    // 128 MB pool
    llm_n_batch = 2048;                 // 2K batch
} else {                                 // M3 / M2 / M1 16-24 GB
    pool_bytes  = 64  * 1024 * 1024;    // 64 MB pool
    llm_n_batch = 1024;                 // 1K batch
}

See src/core/hardware_profile.h:131-146.

Metal GPU Detection

Automatically enables Metal GPU offload on macOS:

#if defined(__APPLE__)
    p.has_metal          = true;
    p.llm_gpu_layers     = 99;     // All layers to GPU
    p.llm_flash_attn     = true;   // Flash Attention enabled
    p.llm_n_threads      = 1;      // GPU-bound decode
    p.llm_n_threads_batch = p.perf_cores;  // Prompt eval on CPU
#endif

See src/core/hardware_profile.h:111-117.

Performance Tuning

GPU Layers

Control how many LLM layers run on GPU:

rcli --gpu-layers 0    # CPU-only (slower, lower memory)
rcli --gpu-layers 20   # Hybrid: 20 layers on GPU
rcli --gpu-layers 99   # All layers on GPU (default, fastest)

For models >4B params on devices with <32GB RAM, reduce GPU layers to avoid OOM.

Context Size

Larger context = more conversation history, but slower and more memory:

rcli --ctx-size 2048   # 2K context (faster, less memory)
rcli --ctx-size 4096   # 4K context (default)
rcli --ctx-size 8192   # 8K context (slower, more memory)

Qwen3 supports up to 32K context. LFM2 supports up to 128K context.

Thread Count

Override auto-detected thread count:

export RCLI_LLM_THREADS=4          # Decode threads
export RCLI_LLM_THREADS_BATCH=8    # Prompt eval threads

Default: 1 decode thread (GPU-bound), all P-cores for prompt eval.

Batch Size

Larger batches = faster prompt processing, but more memory:

export RCLI_LLM_BATCH=512   # Low memory
export RCLI_LLM_BATCH=2048  # Default (high RAM)
export RCLI_LLM_BATCH=4096  # Mac Studio (64+ GB)

Benchmarking

Run All Benchmarks

rcli bench  # STT, LLM, TTS, E2E, RAG, tools, memory

Individual Suites

rcli bench --suite stt      # STT latency + RTF
rcli bench --suite llm      # LLM first token + throughput
rcli bench --suite tts      # TTS latency + RTF
rcli bench --suite e2e      # End-to-end voice pipeline
rcli bench --suite tools    # Tool calling accuracy + latency
rcli bench --suite rag      # Retrieval latency + accuracy
rcli bench --suite memory   # Memory pool usage

Compare All Models

rcli bench --all-llm --suite llm    # Compare all installed LLMs
rcli bench --all-tts --suite tts    # Compare all installed TTS voices

Export Results

rcli bench --output results.json  # Export to JSON

JSON schema:

{
  "timestamp": "2026-03-07T12:34:56Z",
  "hardware": {
    "platform": "macos",
    "cpu_cores": 14,
    "ram_gb": 36,
    "gpu": "M3 Max 30-core"
  },
  "stt": {
    "model": "whisper-base.en",
    "avg_latency_ms": 43.7,
    "rtf": 0.022
  },
  "llm": {
    "model": "qwen3-0.6b",
    "first_token_ms": 22.5,
    "throughput_tps": 159.6
  },
  "tts": {
    "model": "piper-lessac",
    "avg_latency_ms": 150.6,
    "rtf": 0.18
  },
  "e2e_latency_ms": 131,
  "rag_latency_ms": 3.82
}

Profiling

Memory Pool Stats

rcli info  # Shows pool usage + high-water mark

Output:

Memory pool: 64 MB allocated, 58.2 MB used (90.9%)
High-water mark: 59.1 MB

Live Performance Metrics

Press M in TUI to see real-time metrics:

LLM tokens/sec
Memory pool utilization
Ring buffer fill %
CPU/GPU usage (via IOKit)

Next Steps

Configuration

Config files, environment variables, and tuning

Troubleshooting

Common issues, debugging, and logs

Get Started

Core Features

Commands

Models

Actions

Advanced

Development

​Benchmark Results

​End-to-End Latency

​LLM Performance by Model

​TTS Performance by Voice

​Tool Calling Performance

​RAG Performance

​Memory Usage

​Optimization Techniques

​1. KV Cache Reuse

​2. Flash Attention

​3. Metal GPU Offload

​4. Huge Pages (2MB Superpages)

​5. Lock-Free Ring Buffers

​6. Double-Buffered TTS

​7. Sentence-Level Streaming

​8. Speculative Tool Detection

​9. Tool Definition Filtering

​10. Conversation History Trimming

​Hardware Profiling

​CPU Topology Detection

​RAM-Based Configuration

​Metal GPU Detection

​Performance Tuning

​GPU Layers

​Context Size

​Thread Count

​Batch Size

​Benchmarking

​Run All Benchmarks

​Individual Suites

​Compare All Models

​Export Results

​Profiling

​Memory Pool Stats

​Live Performance Metrics

​Next Steps

Configuration

Troubleshooting

Build docs developers (and LLMs) love

Benchmark Results

End-to-End Latency

LLM Performance by Model

TTS Performance by Voice

Tool Calling Performance

RAG Performance

Memory Usage

Optimization Techniques

1. KV Cache Reuse

2. Flash Attention

3. Metal GPU Offload

4. Huge Pages (2MB Superpages)

5. Lock-Free Ring Buffers

6. Double-Buffered TTS

7. Sentence-Level Streaming

8. Speculative Tool Detection

9. Tool Definition Filtering

10. Conversation History Trimming

Hardware Profiling

CPU Topology Detection

RAM-Based Configuration

Metal GPU Detection

Performance Tuning

GPU Layers

Context Size

Thread Count

Batch Size

Benchmarking

Run All Benchmarks

Individual Suites

Compare All Models

Export Results

Profiling

Memory Pool Stats

Live Performance Metrics

Next Steps