RCLI includes a comprehensive benchmark suite for measuring latency, throughput, and accuracy across all AI subsystems.
Usage
rcli bench # Run all benchmarks
rcli bench --suite llm # LLM only
rcli bench --suite stt,tts # Multiple suites
rcli bench --output results.json # Export to JSON
rcli bench --all-llm --suite llm # Compare all installed LLMs
Benchmark Suites
Speech-to-text latency and accuracy (WER)
Time-to-first-token, throughput, prompt caching
Text-to-speech synthesis latency and real-time factor
Full pipeline: STT → LLM → TTS with parallel sentence streaming
Tool calling accuracy and latency
Embedding, vector search, BM25, hybrid retrieval
Process RSS and memory pool usage
Options
Comma-separated list of suites to run
Number of measured runs per test
Export results to JSON file
Load RAG index for RAG benchmarks
Benchmark all installed LLM models
Benchmark all installed TTS voices
STT Benchmark
Measures offline STT (Whisper/Parakeet) transcription latency:
rcli bench --suite stt
# Output:
--- STT Benchmark ---
Run 1: 43.2ms - "open Safari and go to GitHub"
Run 2: 41.8ms - "open Safari and go to GitHub"
Run 3: 44.1ms - "open Safari and go to GitHub"
STT Latency (avg): 43.0ms
STT Audio Duration: 1920ms
STT Real-Time Factor: 0.022x
Metrics
- Latency — Time to transcribe audio
- Audio Duration — Length of test audio
- Real-Time Factor (RTF) — latency / audio_duration (lower is better)
RTF < 0.1x means 10x faster than real-time (excellent for on-device)
LLM Benchmark
Measures LLM generation speed with various prompt lengths:
rcli bench --suite llm
# Output:
--- LLM Benchmark ---
Prompt: "What is the capital of France?"
Response: "The capital of France is Paris."
First token: 22.5ms, 159.6 tok/s, 12 tokens
Prompt: "Explain quantum computing in one sentence."
Response: "Quantum computing uses qubits..."
First token: 18.3ms, 187.2 tok/s, 34 tokens
LLM First Token (avg): 20.4ms
LLM Throughput (avg): 173.4 tok/s
LLM Prompt Eval: 2847 tok/s
Metrics
- TTFT (Time to First Token) — Latency before first token
- Throughput — Tokens generated per second
- Prompt Eval — Prompt processing speed
Prompt Caching Benchmark
Tests system prompt KV caching:
--- LLM Cached Prompt Benchmark ---
Prompt: "What is the capital of France?"
First token: 15.2ms, 189.3 tok/s, 12 tokens
LLM Cached TTFT (avg): 15.2ms
Cached prompts are ~30% faster than uncached.
By-Length Benchmark
Tests LLM performance across response lengths:
--- LLM By Length Benchmark ---
[Short]
"What is the capital of France?"
TTFT: 18ms, 203 tok/s, 12 tokens
[Medium]
"Explain how a car engine works."
TTFT: 20ms, 187 tok/s, 87 tokens
[Long]
"Explain 5 important laws of physics."
TTFT: 22ms, 172 tok/s, 234 tokens
TTS Benchmark
Measures TTS synthesis latency:
rcli bench --suite tts
# Output:
--- TTS Benchmark ---
"Hello, I am your AI assistant."
Latency: 142.3ms, Audio: 1680ms, RTF: 0.85
"The weather today is partly cloudy."
Latency: 156.7ms, Audio: 2100ms, RTF: 0.75
TTS Latency (avg): 149.5ms
TTS Real-Time Factor (avg): 0.80x
Metrics
- Latency — Time to synthesize audio
- Audio Duration — Length of synthesized audio
- RTF — latency / audio_duration (< 1.0x is real-time capable)
E2E Benchmark
Measures full pipeline latency (STT → LLM → TTS):
rcli bench --suite e2e
# Output:
--- E2E Pipeline Benchmark ---
Input: test_audio.wav
Transcript: "open Safari and go to GitHub"
Response: "Opening Safari and navigating to GitHub."
E2E STT: 43.2ms
E2E LLM First Token: 18.7ms
E2E LLM Total: 187.3ms
E2E TTS First Sentence: 142.1ms
E2E Latency (STT+LLM_FT+TTS): 204.0ms
E2E Total: 372.6ms
Metrics
- E2E Latency (TTFA) — Time to first audio output (STT + LLM first token + TTS first sentence)
- E2E Total — Complete pipeline (including full LLM + TTS)
TTFA < 200ms is the target for responsive voice AI
Tests parallel LLM + TTS streaming:
--- E2E Long-Form Benchmark ---
Prompt: "Explain 5 important laws of physics."
LLM tokens: 234 at 172 tok/s
LLM first token: 22ms
LLM total: 1358ms
TTS first sentence ready: 189ms
E2E latency (first audio): 211ms
Total (LLM+TTS complete): 1847ms
Parallel TTS ensures audio starts playing while LLM is still generating.
Measures tool calling accuracy and latency:
rcli bench --suite tools
# Output:
--- Tool Calling Benchmark ---
Query: "What time is it right now?"
First pass: 187.3ms, detected=yes (tool=get_current_time)
Tool exec: 12.4ms, result={"time": "2:30 PM"}
Second pass: 156.8ms
Response: "It's 2:30 PM."
Total: 356.5ms
Query: "Calculate 42 plus 17"
First pass: 174.2ms, detected=yes (tool=calculate)
Tool exec: 8.1ms, result={"result": 59}
Second pass: 142.3ms
Response: "42 plus 17 equals 59."
Total: 324.6ms
Tool Success Rate: 100%
Metrics
- First pass — LLM generates tool call
- Tool exec — Action execution time
- Second pass — LLM generates natural language response
- Total — End-to-end tool calling latency
- Success Rate — Percentage of correct tool detections
RAG Benchmark
Measures retrieval performance:
rcli bench --suite rag --rag ~/Library/RCLI/index
# Output:
--- RAG Benchmark ---
Embedding: 7.8ms
Vector search: 2.1ms (5000 chunks)
BM25 search: 0.9ms
RRF fusion: 0.4ms
Total retrieval: 3.8ms
Full RAG query: 234.7ms
• Retrieval: 3.8ms
• LLM: 230.9ms (87 tok, 189 tok/s)
Metrics
- Embedding — Query embedding time
- Vector search — HNSW index search
- BM25 search — Full-text search
- RRF fusion — Reciprocal rank fusion
- Total retrieval — Combined retrieval latency
Memory Benchmark
Reports process memory usage:
--- Memory Benchmark ---
Pool allocated: 64 MB
Process RSS: 1.2 GB
Comparing Models
All LLMs
rcli bench --all-llm --suite llm
# Output:
--- LLM Benchmark (All Models) ---
[LFM2 350M]
TTFT: 12ms, 350 tok/s
[Qwen3 0.6B]
TTFT: 18ms, 250 tok/s
[LFM2 1.2B Tool]
TTFT: 22ms, 180 tok/s
[Qwen3.5 2B]
TTFT: 25ms, 150 tok/s
[Qwen3.5 4B]
TTFT: 42ms, 75 tok/s
All TTS
rcli bench --all-tts --suite tts
# Output:
--- TTS Benchmark (All Voices) ---
[Piper Lessac]
Latency: 142ms, RTF: 0.8x
[Piper Amy]
Latency: 138ms, RTF: 0.7x
[Kokoro English]
Latency: 189ms, RTF: 1.1x
JSON Export
rcli bench --output results.json
# results.json:
{
"results": [
{"name": "STT Latency (avg)", "category": "stt", "value": 43.0, "unit": "ms"},
{"name": "LLM First Token (avg)", "category": "llm", "value": 20.4, "unit": "ms"},
{"name": "LLM Throughput (avg)", "category": "llm", "value": 173.4, "unit": "tok/s"},
{"name": "TTS Latency (avg)", "category": "tts", "value": 149.5, "unit": "ms"},
{"name": "E2E Latency", "category": "e2e", "value": 204.0, "unit": "ms"}
]
}
Example Output (All Suites)
rcli bench
╔══════════════════════════════════════════╗
║ RCLI Benchmark Suite ║
║ Suite: all Runs: 3 ║
╚══════════════════════════════════════════╝
Initializing engine...
Benchmarking:
LLM = Qwen3.5 2B
STT = Whisper base.en
TTS = Piper Lessac
--- Memory Benchmark ---
Pool allocated: 64 MB
--- STT Benchmark ---
Run 1: 43.2ms - "open Safari"
Run 2: 41.8ms - "open Safari"
Run 3: 44.1ms - "open Safari"
--- LLM Benchmark ---
Prompt: "What is the capital of France?"
TTFT: 22.5ms, 159.6 tok/s, 12 tokens
--- TTS Benchmark ---
"Hello, I am your AI assistant."
Latency: 142.3ms, RTF: 0.85
--- E2E Pipeline Benchmark ---
E2E Latency: 204.0ms
╔════════════════════════════════════════════════════════╗
║ BENCHMARK RESULTS ║
╠════════════════════════════════════════════════════════╣
║ [ memory ] ║
║ Pool Size 64.0 MB ║
║ [ stt ] ║
║ STT Latency (avg) 43.0 ms ║
║ STT Real-Time Factor 0.0 x ║
║ [ llm ] ║
║ LLM First Token (avg) 20.4 ms ║
║ LLM Throughput (avg) 173.4 tok/s ║
║ [ tts ] ║
║ TTS Latency (avg) 149.5 ms ║
║ TTS Real-Time Factor (avg) 0.8 x ║
║ [ e2e ] ║
║ E2E Latency (STT+LLM_FT+TTS) 204.0 ms ║
╚════════════════════════════════════════════════════════╝
RCLI — Benchmark complete.
On Apple M3 Max (14-core CPU, 30-core GPU, 36GB RAM):
| Metric | Target | Actual |
|---|
| STT latency | < 50ms | 43.7ms |
| LLM TTFT | < 25ms | 22.5ms |
| LLM throughput | > 150 tok/s | 159.6 tok/s |
| TTS latency | < 200ms | 150.6ms |
| E2E latency (TTFA) | < 200ms | 131ms |
| RAG retrieval | < 5ms | 3.82ms |
Performance scales with chip generation (M1 < M2 < M3 < M4)