Benchmark Results
All measurements on Apple M3 Max (14-core CPU, 30-core GPU, 36 GB unified memory) running macOS.
End-to-End Latency
Component Metric Value Notes STT Avg latency 43.7 ms Whisper base.en, 3.5 sec audio STT Real-time factor 0.022x 22ms processing per 1 sec audio LLM Time to first token 22.5 ms Qwen3 0.6B with KV cache LLM Generation throughput 159.6 tok/s Metal GPU, Flash Attention TTS Avg latency 150.6 ms Piper Lessac, 15-word sentence RAG Hybrid retrieval 3.82 ms 5K chunks, vector + BM25 + RRF E2E Voice-in to audio-out 131 ms Full pipeline with KV cache
E2E latency = STT + LLM first token + TTS first sentence. Measured from end of user speech to first audio output.
Tested with 512-token prompt, 128-token generation:
Model Size First Token (ms) Throughput (tok/s) Memory (MB) LFM2 350M 219 MB 18.2 351.4 450 Qwen3 0.6B 456 MB 22.5 249.8 680 LFM2 1.2B Tool 731 MB 26.1 182.7 950 Qwen3.5 2B 1.2 GB 38.9 145.3 1450 Qwen3.5 4B 2.7 GB 61.2 78.6 3100
Run rcli bench --all-llm --suite llm to compare all installed LLM models on your hardware.
Tested with 50-word sentence:
Voice Size Latency (ms) Real-time Factor Quality Piper Lessac 60 MB 150.6 0.18x Good Piper Amy 60 MB 148.3 0.17x Good KittenTTS Nano 90 MB 201.4 0.24x Good Kokoro English v0.19 310 MB 287.5 0.34x Excellent
43 macOS actions tested with 100 queries:
Model Accuracy Avg Latency (ms) False Positives LFM2 1.2B Tool 94.3% 156 2.1% Qwen3.5 2B 91.7% 189 3.8% Qwen3 0.6B 87.2% 142 5.4%
Tool calling accuracy = (correct actions / total queries). Includes parsing, argument extraction, and execution success.
Tested with 5,000 document chunks (Snowflake Arctic Embed S):
Metric Value Configuration Vector search 1.2 ms HNSW (ef=50, M=16) BM25 search 0.8 ms In-memory inverted index Embedding generation 1.5 ms Cached (99.9% hit rate) RRF fusion 0.3 ms k=60, top-20 candidates Total retrieval 3.8 ms Hybrid: vector + BM25 + RRF Embedding cache hit rate 99.9% LRU cache, 256 MB
Memory Usage
Typical memory footprint for default configuration:
Component Memory Notes LLM model 456 MB Qwen3 0.6B Q4_K_M LLM KV cache 128 MB 4096 context STT model 140 MB Whisper base.en TTS model 60 MB Piper Lessac Embedding model 34 MB Arctic Embed S Q8_0 Memory pool 64 MB Pre-allocated arena Ring buffers 2.5 MB Capture + playback RAG index 85 MB 5K chunks (HNSW + BM25) Total ~970 MB At runtime
Optimization Techniques
1. KV Cache Reuse
Impact: 50-70% reduction in time-to-first-token
// Without KV cache (cold start)
Time to first token: 45.3 ms
// With KV cache (system prompt cached)
Time to first token: 22.5 ms
The system prompt (including tool definitions) is cached in llama.cpp’s KV cache at initialization. Subsequent queries only process the user turn:
// Init: cache system prompt once
llm_ . cache_system_prompt (tool_system);
// Query: only send user turn
std ::string prompt = llm_ . profile (). build_user_turn (user_text);
llm_ . generate_with_cached_prompt (prompt, callback);
See src/pipeline/orchestrator.cpp:75-79.
2. Flash Attention
Impact: 15-25% faster attention, lower memory
Enabled automatically on Metal GPU via llama.cpp:
llama_context_params ctx_params = llama_context_default_params ();
ctx_params . flash_attn = true ; // O(n) memory instead of O(n²)
Flash Attention reduces KV cache memory and speeds up self-attention by ~20% on Apple Silicon.
See src/core/hardware_profile.h:113 for auto-detection.
Impact: 3-5x faster inference vs CPU-only
// Auto-configured based on hardware
hw . llm_gpu_layers = 99 ; // All layers to GPU
hw . llm_n_threads = 1 ; // GPU-bound: 1 thread optimal
Metal GPU offload is enabled by default on macOS. For LLMs <4B params, all layers fit on GPU.
Benchmark (Qwen3 0.6B, M3 Max):
CPU-only (8 threads): 62.3 tok/s
Metal GPU (all layers): 249.8 tok/s (4x faster)
4. Huge Pages (2MB Superpages)
Impact: 10-15% reduction in TLB misses
For memory pools ≥4MB, RCLI uses 2MB superpages instead of 4KB pages:
// macOS: vm_allocate with superpage flag
vm_allocate ( mach_task_self (), & addr, size,
VM_FLAGS_ANYWHERE | VM_FLAGS_SUPERPAGE_SIZE_2MB);
// Linux: mmap + madvise
mmap (...) + madvise (addr, size, MADV_HUGEPAGE);
Reduces TLB pressure for large audio buffers and ring buffers.
See src/core/memory_pool.h:36-68.
5. Lock-Free Ring Buffers
Impact: Zero contention, <10ns overhead
Single-Producer Single-Consumer (SPSC) ring buffers with atomic head/tail pointers. No locks, no syscalls.
size_t write ( const T * src , size_t count ) {
const size_t w = write_pos_ . load ( std ::memory_order_relaxed);
const size_t r = read_pos_ . load ( std ::memory_order_acquire);
const size_t available = capacity_ - (w - r);
// Zero-copy memcpy
std :: memcpy (data_ + (w & mask_), src, count * sizeof (T));
write_pos_ . store (w + count, std ::memory_order_release);
}
See src/core/ring_buffer.h:48-66.
6. Double-Buffered TTS
Impact: Overlaps synthesis with playback (perceived latency ↓30%)
Timeline:
[LLM generates] "Hello there." "How are you?"
↓ ↓
[TTS synthesizes] S1 ──────── S2 ────────
[Audio plays] ▶ S1 ────── ▶ S2 ──────
Next sentence synthesizes while current one plays. User hears audio sooner.
See src/pipeline/orchestrator.cpp:177-210.
7. Sentence-Level Streaming
Impact: Reduces time to first audio by 200-400ms
SentenceDetector emits sentences as soon as boundaries detected (., !, ?) instead of waiting for full LLM completion:
SentenceDetector detector ( queue_sentence ,
min_words = 3 , // Min 3 words for primary break
max_words_sec = 25 , // Secondary break (;,:) at 25 words
word_flush = 7 ); // Flush at 7 words if no punctuation
See src/pipeline/sentence_detector.cpp:6-84.
Impact: Prevents false TTS playback for tool calls
Buffers first 15 tokens to detect tool call tags before streaming:
if (tokens_buffered <= 15 ) {
if ( token_buffer . find ( "<tool_call>" ) != std :: string ::npos) {
detected_tool_call = true ; // Stop streaming to TTS
}
} else {
detector . feed ( tok . text ); // Stream to TTS
}
Adaptive buffering extends window if partial tag detected (e.g., buffer ends with <tool).
See src/pipeline/orchestrator.cpp:215-271.
Impact: 30-50% reduction in prompt tokens for tool calls
Instead of sending all 43 action definitions, keyword matching scores relevance and sends only top-k:
std ::string hint = tools_ . build_tool_hint (user_text);
// Returns: "Relevant tools: open_app, quit_app"
Reduces prompt tokens from ~2500 (all tools) to ~300 (top-5), improving first-token latency.
10. Conversation History Trimming
Impact: Fits 10+ turns in 4096 context window
int ctx_size = llm_ . context_size ();
int history_budget = ctx_size - system_tokens - user_tokens - 512 ;
for ( int i = history . size () - 1 ; i >= 0 ; i -- ) {
int entry_tokens = llm_ . count_tokens ( history [i]);
if (total + entry_tokens > budget) break ;
trimmed . insert ( trimmed . begin (), history [i]);
}
Evicts oldest turns first to stay within context limit.
See src/pipeline/orchestrator.cpp:906-921.
Hardware Profiling
CPU Topology Detection
RCLI detects P-cores (performance) and E-cores (efficiency) on Apple Silicon:
// macOS sysctl API
sysctlbyname ( "hw.perflevel0.physicalcpu" , & p_cores, ...);
sysctlbyname ( "hw.perflevel1.physicalcpu" , & e_cores, ...);
// Example output (M3 Max):
p_cores = 10 // High-performance cores
e_cores = 4 // Efficiency cores
LLM decode uses 1 thread (GPU-bound). Prompt eval uses all P-cores (CPU-bound matrix ops).
See src/core/hardware_profile.h:94-104.
RAM-Based Configuration
Pool size and batch size scale with available RAM:
if (ram_total_mb >= 64 * 1024 ) { // Mac Studio / Mac Pro
pool_bytes = 256 * 1024 * 1024 ; // 256 MB pool
llm_n_batch = 4096 ; // 4K batch
} else if (ram_total_mb >= 32 * 1024 ) { // M3 Max 36/48 GB
pool_bytes = 128 * 1024 * 1024 ; // 128 MB pool
llm_n_batch = 2048 ; // 2K batch
} else { // M3 / M2 / M1 16-24 GB
pool_bytes = 64 * 1024 * 1024 ; // 64 MB pool
llm_n_batch = 1024 ; // 1K batch
}
See src/core/hardware_profile.h:131-146.
Automatically enables Metal GPU offload on macOS:
#if defined ( __APPLE__ )
p . has_metal = true ;
p . llm_gpu_layers = 99 ; // All layers to GPU
p . llm_flash_attn = true ; // Flash Attention enabled
p . llm_n_threads = 1 ; // GPU-bound decode
p . llm_n_threads_batch = p . perf_cores ; // Prompt eval on CPU
#endif
See src/core/hardware_profile.h:111-117.
GPU Layers
Control how many LLM layers run on GPU:
rcli --gpu-layers 0 # CPU-only (slower, lower memory)
rcli --gpu-layers 20 # Hybrid: 20 layers on GPU
rcli --gpu-layers 99 # All layers on GPU (default, fastest)
For models >4B params on devices with <32GB RAM, reduce GPU layers to avoid OOM.
Context Size
Larger context = more conversation history, but slower and more memory:
rcli --ctx-size 2048 # 2K context (faster, less memory)
rcli --ctx-size 4096 # 4K context (default)
rcli --ctx-size 8192 # 8K context (slower, more memory)
Qwen3 supports up to 32K context. LFM2 supports up to 128K context.
Thread Count
Override auto-detected thread count:
export RCLI_LLM_THREADS = 4 # Decode threads
export RCLI_LLM_THREADS_BATCH = 8 # Prompt eval threads
Default: 1 decode thread (GPU-bound), all P-cores for prompt eval.
Batch Size
Larger batches = faster prompt processing, but more memory:
export RCLI_LLM_BATCH = 512 # Low memory
export RCLI_LLM_BATCH = 2048 # Default (high RAM)
export RCLI_LLM_BATCH = 4096 # Mac Studio (64+ GB)
Benchmarking
Run All Benchmarks
rcli bench # STT, LLM, TTS, E2E, RAG, tools, memory
Individual Suites
rcli bench --suite stt # STT latency + RTF
rcli bench --suite llm # LLM first token + throughput
rcli bench --suite tts # TTS latency + RTF
rcli bench --suite e2e # End-to-end voice pipeline
rcli bench --suite tools # Tool calling accuracy + latency
rcli bench --suite rag # Retrieval latency + accuracy
rcli bench --suite memory # Memory pool usage
Compare All Models
rcli bench --all-llm --suite llm # Compare all installed LLMs
rcli bench --all-tts --suite tts # Compare all installed TTS voices
Export Results
rcli bench --output results.json # Export to JSON
JSON schema:
{
"timestamp" : "2026-03-07T12:34:56Z" ,
"hardware" : {
"platform" : "macos" ,
"cpu_cores" : 14 ,
"ram_gb" : 36 ,
"gpu" : "M3 Max 30-core"
},
"stt" : {
"model" : "whisper-base.en" ,
"avg_latency_ms" : 43.7 ,
"rtf" : 0.022
},
"llm" : {
"model" : "qwen3-0.6b" ,
"first_token_ms" : 22.5 ,
"throughput_tps" : 159.6
},
"tts" : {
"model" : "piper-lessac" ,
"avg_latency_ms" : 150.6 ,
"rtf" : 0.18
},
"e2e_latency_ms" : 131 ,
"rag_latency_ms" : 3.82
}
Profiling
Memory Pool Stats
rcli info # Shows pool usage + high-water mark
Output:
Memory pool: 64 MB allocated, 58.2 MB used (90.9%)
High-water mark: 59.1 MB
Press M in TUI to see real-time metrics:
LLM tokens/sec
Memory pool utilization
Ring buffer fill %
CPU/GPU usage (via IOKit)
Next Steps
Configuration Config files, environment variables, and tuning
Troubleshooting Common issues, debugging, and logs