Skip to main content

Benchmark Results

All measurements on Apple M3 Max (14-core CPU, 30-core GPU, 36 GB unified memory) running macOS.

End-to-End Latency

ComponentMetricValueNotes
STTAvg latency43.7 msWhisper base.en, 3.5 sec audio
STTReal-time factor0.022x22ms processing per 1 sec audio
LLMTime to first token22.5 msQwen3 0.6B with KV cache
LLMGeneration throughput159.6 tok/sMetal GPU, Flash Attention
TTSAvg latency150.6 msPiper Lessac, 15-word sentence
RAGHybrid retrieval3.82 ms5K chunks, vector + BM25 + RRF
E2EVoice-in to audio-out131 msFull pipeline with KV cache
E2E latency = STT + LLM first token + TTS first sentence. Measured from end of user speech to first audio output.

LLM Performance by Model

Tested with 512-token prompt, 128-token generation:
ModelSizeFirst Token (ms)Throughput (tok/s)Memory (MB)
LFM2 350M219 MB18.2351.4450
Qwen3 0.6B456 MB22.5249.8680
LFM2 1.2B Tool731 MB26.1182.7950
Qwen3.5 2B1.2 GB38.9145.31450
Qwen3.5 4B2.7 GB61.278.63100
Run rcli bench --all-llm --suite llm to compare all installed LLM models on your hardware.

TTS Performance by Voice

Tested with 50-word sentence:
VoiceSizeLatency (ms)Real-time FactorQuality
Piper Lessac60 MB150.60.18xGood
Piper Amy60 MB148.30.17xGood
KittenTTS Nano90 MB201.40.24xGood
Kokoro English v0.19310 MB287.50.34xExcellent

Tool Calling Performance

43 macOS actions tested with 100 queries:
ModelAccuracyAvg Latency (ms)False Positives
LFM2 1.2B Tool94.3%1562.1%
Qwen3.5 2B91.7%1893.8%
Qwen3 0.6B87.2%1425.4%
Tool calling accuracy = (correct actions / total queries). Includes parsing, argument extraction, and execution success.

RAG Performance

Tested with 5,000 document chunks (Snowflake Arctic Embed S):
MetricValueConfiguration
Vector search1.2 msHNSW (ef=50, M=16)
BM25 search0.8 msIn-memory inverted index
Embedding generation1.5 msCached (99.9% hit rate)
RRF fusion0.3 msk=60, top-20 candidates
Total retrieval3.8 msHybrid: vector + BM25 + RRF
Embedding cache hit rate99.9%LRU cache, 256 MB

Memory Usage

Typical memory footprint for default configuration:
ComponentMemoryNotes
LLM model456 MBQwen3 0.6B Q4_K_M
LLM KV cache128 MB4096 context
STT model140 MBWhisper base.en
TTS model60 MBPiper Lessac
Embedding model34 MBArctic Embed S Q8_0
Memory pool64 MBPre-allocated arena
Ring buffers2.5 MBCapture + playback
RAG index85 MB5K chunks (HNSW + BM25)
Total~970 MBAt runtime

Optimization Techniques

1. KV Cache Reuse

Impact: 50-70% reduction in time-to-first-token
// Without KV cache (cold start)
Time to first token: 45.3 ms

// With KV cache (system prompt cached)
Time to first token: 22.5 ms
The system prompt (including tool definitions) is cached in llama.cpp’s KV cache at initialization. Subsequent queries only process the user turn:
// Init: cache system prompt once
llm_.cache_system_prompt(tool_system);

// Query: only send user turn
std::string prompt = llm_.profile().build_user_turn(user_text);
llm_.generate_with_cached_prompt(prompt, callback);
See src/pipeline/orchestrator.cpp:75-79.

2. Flash Attention

Impact: 15-25% faster attention, lower memory Enabled automatically on Metal GPU via llama.cpp:
llama_context_params ctx_params = llama_context_default_params();
ctx_params.flash_attn = true;  // O(n) memory instead of O(n²)
Flash Attention reduces KV cache memory and speeds up self-attention by ~20% on Apple Silicon. See src/core/hardware_profile.h:113 for auto-detection.

3. Metal GPU Offload

Impact: 3-5x faster inference vs CPU-only
// Auto-configured based on hardware
hw.llm_gpu_layers = 99;  // All layers to GPU
hw.llm_n_threads = 1;    // GPU-bound: 1 thread optimal
Metal GPU offload is enabled by default on macOS. For LLMs <4B params, all layers fit on GPU. Benchmark (Qwen3 0.6B, M3 Max):
  • CPU-only (8 threads): 62.3 tok/s
  • Metal GPU (all layers): 249.8 tok/s (4x faster)

4. Huge Pages (2MB Superpages)

Impact: 10-15% reduction in TLB misses For memory pools ≥4MB, RCLI uses 2MB superpages instead of 4KB pages:
// macOS: vm_allocate with superpage flag
vm_allocate(mach_task_self(), &addr, size, 
    VM_FLAGS_ANYWHERE | VM_FLAGS_SUPERPAGE_SIZE_2MB);

// Linux: mmap + madvise
mmap(...) + madvise(addr, size, MADV_HUGEPAGE);
Reduces TLB pressure for large audio buffers and ring buffers. See src/core/memory_pool.h:36-68.

5. Lock-Free Ring Buffers

Impact: Zero contention, <10ns overhead Single-Producer Single-Consumer (SPSC) ring buffers with atomic head/tail pointers. No locks, no syscalls.
size_t write(const T* src, size_t count) {
    const size_t w = write_pos_.load(std::memory_order_relaxed);
    const size_t r = read_pos_.load(std::memory_order_acquire);
    const size_t available = capacity_ - (w - r);
    
    // Zero-copy memcpy
    std::memcpy(data_ + (w & mask_), src, count * sizeof(T));
    write_pos_.store(w + count, std::memory_order_release);
}
See src/core/ring_buffer.h:48-66.

6. Double-Buffered TTS

Impact: Overlaps synthesis with playback (perceived latency ↓30%)
Timeline:
[LLM generates]  "Hello there."  "How are you?"
                      ↓                  ↓
[TTS synthesizes]    S1 ────────        S2 ────────
[Audio plays]            ▶ S1 ──────        ▶ S2 ──────
Next sentence synthesizes while current one plays. User hears audio sooner. See src/pipeline/orchestrator.cpp:177-210.

7. Sentence-Level Streaming

Impact: Reduces time to first audio by 200-400ms SentenceDetector emits sentences as soon as boundaries detected (., !, ?) instead of waiting for full LLM completion:
SentenceDetector detector(queue_sentence, 
    min_words=3,        // Min 3 words for primary break
    max_words_sec=25,   // Secondary break (;,:) at 25 words
    word_flush=7);      // Flush at 7 words if no punctuation
See src/pipeline/sentence_detector.cpp:6-84.

8. Speculative Tool Detection

Impact: Prevents false TTS playback for tool calls Buffers first 15 tokens to detect tool call tags before streaming:
if (tokens_buffered <= 15) {
    if (token_buffer.find("&lt;tool_call&gt;") != std::string::npos) {
        detected_tool_call = true;  // Stop streaming to TTS
    }
} else {
    detector.feed(tok.text);  // Stream to TTS
}
Adaptive buffering extends window if partial tag detected (e.g., buffer ends with <tool). See src/pipeline/orchestrator.cpp:215-271.

9. Tool Definition Filtering

Impact: 30-50% reduction in prompt tokens for tool calls Instead of sending all 43 action definitions, keyword matching scores relevance and sends only top-k:
std::string hint = tools_.build_tool_hint(user_text);
// Returns: "Relevant tools: open_app, quit_app"
Reduces prompt tokens from ~2500 (all tools) to ~300 (top-5), improving first-token latency.

10. Conversation History Trimming

Impact: Fits 10+ turns in 4096 context window
int ctx_size = llm_.context_size();
int history_budget = ctx_size - system_tokens - user_tokens - 512;

for (int i = history.size() - 1; i >= 0; i--) {
    int entry_tokens = llm_.count_tokens(history[i]);
    if (total + entry_tokens > budget) break;
    trimmed.insert(trimmed.begin(), history[i]);
}
Evicts oldest turns first to stay within context limit. See src/pipeline/orchestrator.cpp:906-921.

Hardware Profiling

CPU Topology Detection

RCLI detects P-cores (performance) and E-cores (efficiency) on Apple Silicon:
// macOS sysctl API
sysctlbyname("hw.perflevel0.physicalcpu", &p_cores, ...);
sysctlbyname("hw.perflevel1.physicalcpu", &e_cores, ...);

// Example output (M3 Max):
p_cores = 10  // High-performance cores
e_cores = 4   // Efficiency cores
LLM decode uses 1 thread (GPU-bound). Prompt eval uses all P-cores (CPU-bound matrix ops). See src/core/hardware_profile.h:94-104.

RAM-Based Configuration

Pool size and batch size scale with available RAM:
if (ram_total_mb >= 64 * 1024) {        // Mac Studio / Mac Pro
    pool_bytes  = 256 * 1024 * 1024;    // 256 MB pool
    llm_n_batch = 4096;                 // 4K batch
} else if (ram_total_mb >= 32 * 1024) { // M3 Max 36/48 GB
    pool_bytes  = 128 * 1024 * 1024;    // 128 MB pool
    llm_n_batch = 2048;                 // 2K batch
} else {                                 // M3 / M2 / M1 16-24 GB
    pool_bytes  = 64  * 1024 * 1024;    // 64 MB pool
    llm_n_batch = 1024;                 // 1K batch
}
See src/core/hardware_profile.h:131-146.

Metal GPU Detection

Automatically enables Metal GPU offload on macOS:
#if defined(__APPLE__)
    p.has_metal          = true;
    p.llm_gpu_layers     = 99;     // All layers to GPU
    p.llm_flash_attn     = true;   // Flash Attention enabled
    p.llm_n_threads      = 1;      // GPU-bound decode
    p.llm_n_threads_batch = p.perf_cores;  // Prompt eval on CPU
#endif
See src/core/hardware_profile.h:111-117.

Performance Tuning

GPU Layers

Control how many LLM layers run on GPU:
rcli --gpu-layers 0    # CPU-only (slower, lower memory)
rcli --gpu-layers 20   # Hybrid: 20 layers on GPU
rcli --gpu-layers 99   # All layers on GPU (default, fastest)
For models >4B params on devices with <32GB RAM, reduce GPU layers to avoid OOM.

Context Size

Larger context = more conversation history, but slower and more memory:
rcli --ctx-size 2048   # 2K context (faster, less memory)
rcli --ctx-size 4096   # 4K context (default)
rcli --ctx-size 8192   # 8K context (slower, more memory)
Qwen3 supports up to 32K context. LFM2 supports up to 128K context.

Thread Count

Override auto-detected thread count:
export RCLI_LLM_THREADS=4          # Decode threads
export RCLI_LLM_THREADS_BATCH=8    # Prompt eval threads
Default: 1 decode thread (GPU-bound), all P-cores for prompt eval.

Batch Size

Larger batches = faster prompt processing, but more memory:
export RCLI_LLM_BATCH=512   # Low memory
export RCLI_LLM_BATCH=2048  # Default (high RAM)
export RCLI_LLM_BATCH=4096  # Mac Studio (64+ GB)

Benchmarking

Run All Benchmarks

rcli bench  # STT, LLM, TTS, E2E, RAG, tools, memory

Individual Suites

rcli bench --suite stt      # STT latency + RTF
rcli bench --suite llm      # LLM first token + throughput
rcli bench --suite tts      # TTS latency + RTF
rcli bench --suite e2e      # End-to-end voice pipeline
rcli bench --suite tools    # Tool calling accuracy + latency
rcli bench --suite rag      # Retrieval latency + accuracy
rcli bench --suite memory   # Memory pool usage

Compare All Models

rcli bench --all-llm --suite llm    # Compare all installed LLMs
rcli bench --all-tts --suite tts    # Compare all installed TTS voices

Export Results

rcli bench --output results.json  # Export to JSON
JSON schema:
{
  "timestamp": "2026-03-07T12:34:56Z",
  "hardware": {
    "platform": "macos",
    "cpu_cores": 14,
    "ram_gb": 36,
    "gpu": "M3 Max 30-core"
  },
  "stt": {
    "model": "whisper-base.en",
    "avg_latency_ms": 43.7,
    "rtf": 0.022
  },
  "llm": {
    "model": "qwen3-0.6b",
    "first_token_ms": 22.5,
    "throughput_tps": 159.6
  },
  "tts": {
    "model": "piper-lessac",
    "avg_latency_ms": 150.6,
    "rtf": 0.18
  },
  "e2e_latency_ms": 131,
  "rag_latency_ms": 3.82
}

Profiling

Memory Pool Stats

rcli info  # Shows pool usage + high-water mark
Output:
Memory pool: 64 MB allocated, 58.2 MB used (90.9%)
High-water mark: 59.1 MB

Live Performance Metrics

Press M in TUI to see real-time metrics:
  • LLM tokens/sec
  • Memory pool utilization
  • Ring buffer fill %
  • CPU/GPU usage (via IOKit)

Next Steps

Configuration

Config files, environment variables, and tuning

Troubleshooting

Common issues, debugging, and logs

Build docs developers (and LLMs) love