Skip to main content
RCLI includes a comprehensive benchmark suite for measuring latency, throughput, and accuracy across all AI subsystems.

Usage

rcli bench                          # Run all benchmarks
rcli bench --suite llm              # LLM only
rcli bench --suite stt,tts          # Multiple suites
rcli bench --output results.json    # Export to JSON
rcli bench --all-llm --suite llm    # Compare all installed LLMs

Benchmark Suites

all
string
All benchmarks (default)
stt
string
Speech-to-text latency and accuracy (WER)
llm
string
Time-to-first-token, throughput, prompt caching
tts
string
Text-to-speech synthesis latency and real-time factor
e2e
string
Full pipeline: STT → LLM → TTS with parallel sentence streaming
tools
string
Tool calling accuracy and latency
rag
string
Embedding, vector search, BM25, hybrid retrieval
memory
string
Process RSS and memory pool usage

Options

--suite
string
default:"all"
Comma-separated list of suites to run
--runs
number
default:"3"
Number of measured runs per test
--output
string
Export results to JSON file
--rag
string
Load RAG index for RAG benchmarks
--all-llm
boolean
Benchmark all installed LLM models
--all-tts
boolean
Benchmark all installed TTS voices

STT Benchmark

Measures offline STT (Whisper/Parakeet) transcription latency:
rcli bench --suite stt

# Output:
--- STT Benchmark ---
  Run 1: 43.2ms - "open Safari and go to GitHub"
  Run 2: 41.8ms - "open Safari and go to GitHub"
  Run 3: 44.1ms - "open Safari and go to GitHub"

STT Latency (avg): 43.0ms
STT Audio Duration: 1920ms
STT Real-Time Factor: 0.022x

Metrics

  • Latency — Time to transcribe audio
  • Audio Duration — Length of test audio
  • Real-Time Factor (RTF) — latency / audio_duration (lower is better)
RTF < 0.1x means 10x faster than real-time (excellent for on-device)

LLM Benchmark

Measures LLM generation speed with various prompt lengths:
rcli bench --suite llm

# Output:
--- LLM Benchmark ---
  Prompt: "What is the capital of France?"
  Response: "The capital of France is Paris."
  First token: 22.5ms, 159.6 tok/s, 12 tokens

  Prompt: "Explain quantum computing in one sentence."
  Response: "Quantum computing uses qubits..."
  First token: 18.3ms, 187.2 tok/s, 34 tokens

LLM First Token (avg): 20.4ms
LLM Throughput (avg): 173.4 tok/s
LLM Prompt Eval: 2847 tok/s

Metrics

  • TTFT (Time to First Token) — Latency before first token
  • Throughput — Tokens generated per second
  • Prompt Eval — Prompt processing speed

Prompt Caching Benchmark

Tests system prompt KV caching:
--- LLM Cached Prompt Benchmark ---
  Prompt: "What is the capital of France?"
  First token: 15.2ms, 189.3 tok/s, 12 tokens

LLM Cached TTFT (avg): 15.2ms
Cached prompts are ~30% faster than uncached.

By-Length Benchmark

Tests LLM performance across response lengths:
--- LLM By Length Benchmark ---

  [Short]
    "What is the capital of France?"
    TTFT: 18ms, 203 tok/s, 12 tokens

  [Medium]
    "Explain how a car engine works."
    TTFT: 20ms, 187 tok/s, 87 tokens

  [Long]
    "Explain 5 important laws of physics."
    TTFT: 22ms, 172 tok/s, 234 tokens

TTS Benchmark

Measures TTS synthesis latency:
rcli bench --suite tts

# Output:
--- TTS Benchmark ---
  "Hello, I am your AI assistant."
  Latency: 142.3ms, Audio: 1680ms, RTF: 0.85

  "The weather today is partly cloudy."
  Latency: 156.7ms, Audio: 2100ms, RTF: 0.75

TTS Latency (avg): 149.5ms
TTS Real-Time Factor (avg): 0.80x

Metrics

  • Latency — Time to synthesize audio
  • Audio Duration — Length of synthesized audio
  • RTF — latency / audio_duration (< 1.0x is real-time capable)

E2E Benchmark

Measures full pipeline latency (STT → LLM → TTS):
rcli bench --suite e2e

# Output:
--- E2E Pipeline Benchmark ---
  Input: test_audio.wav
  Transcript: "open Safari and go to GitHub"
  Response: "Opening Safari and navigating to GitHub."

E2E STT: 43.2ms
E2E LLM First Token: 18.7ms
E2E LLM Total: 187.3ms
E2E TTS First Sentence: 142.1ms
E2E Latency (STT+LLM_FT+TTS): 204.0ms
E2E Total: 372.6ms

Metrics

  • E2E Latency (TTFA) — Time to first audio output (STT + LLM first token + TTS first sentence)
  • E2E Total — Complete pipeline (including full LLM + TTS)
TTFA < 200ms is the target for responsive voice AI

Long-Form E2E

Tests parallel LLM + TTS streaming:
--- E2E Long-Form Benchmark ---
  Prompt: "Explain 5 important laws of physics."
  LLM tokens: 234 at 172 tok/s
  LLM first token: 22ms
  LLM total: 1358ms
  TTS first sentence ready: 189ms
  E2E latency (first audio): 211ms
  Total (LLM+TTS complete): 1847ms
Parallel TTS ensures audio starts playing while LLM is still generating.

Tools Benchmark

Measures tool calling accuracy and latency:
rcli bench --suite tools

# Output:
--- Tool Calling Benchmark ---

  Query: "What time is it right now?"
    First pass: 187.3ms, detected=yes (tool=get_current_time)
    Tool exec: 12.4ms, result={"time": "2:30 PM"}
    Second pass: 156.8ms
    Response: "It's 2:30 PM."
    Total: 356.5ms

  Query: "Calculate 42 plus 17"
    First pass: 174.2ms, detected=yes (tool=calculate)
    Tool exec: 8.1ms, result={"result": 59}
    Second pass: 142.3ms
    Response: "42 plus 17 equals 59."
    Total: 324.6ms

Tool Success Rate: 100%

Metrics

  • First pass — LLM generates tool call
  • Tool exec — Action execution time
  • Second pass — LLM generates natural language response
  • Total — End-to-end tool calling latency
  • Success Rate — Percentage of correct tool detections

RAG Benchmark

Measures retrieval performance:
rcli bench --suite rag --rag ~/Library/RCLI/index

# Output:
--- RAG Benchmark ---
  Embedding: 7.8ms
  Vector search: 2.1ms (5000 chunks)
  BM25 search: 0.9ms
  RRF fusion: 0.4ms
  Total retrieval: 3.8ms

  Full RAG query: 234.7ms
 Retrieval: 3.8ms
 LLM: 230.9ms (87 tok, 189 tok/s)

Metrics

  • Embedding — Query embedding time
  • Vector search — HNSW index search
  • BM25 search — Full-text search
  • RRF fusion — Reciprocal rank fusion
  • Total retrieval — Combined retrieval latency

Memory Benchmark

Reports process memory usage:
--- Memory Benchmark ---
  Pool allocated: 64 MB
  Process RSS: 1.2 GB

Comparing Models

All LLMs

rcli bench --all-llm --suite llm

# Output:
--- LLM Benchmark (All Models) ---

  [LFM2 350M]
    TTFT: 12ms, 350 tok/s

  [Qwen3 0.6B]
    TTFT: 18ms, 250 tok/s

  [LFM2 1.2B Tool]
    TTFT: 22ms, 180 tok/s

  [Qwen3.5 2B]
    TTFT: 25ms, 150 tok/s

  [Qwen3.5 4B]
    TTFT: 42ms, 75 tok/s

All TTS

rcli bench --all-tts --suite tts

# Output:
--- TTS Benchmark (All Voices) ---

  [Piper Lessac]
    Latency: 142ms, RTF: 0.8x

  [Piper Amy]
    Latency: 138ms, RTF: 0.7x

  [Kokoro English]
    Latency: 189ms, RTF: 1.1x

JSON Export

rcli bench --output results.json

# results.json:
{
  "results": [
    {"name": "STT Latency (avg)", "category": "stt", "value": 43.0, "unit": "ms"},
    {"name": "LLM First Token (avg)", "category": "llm", "value": 20.4, "unit": "ms"},
    {"name": "LLM Throughput (avg)", "category": "llm", "value": 173.4, "unit": "tok/s"},
    {"name": "TTS Latency (avg)", "category": "tts", "value": 149.5, "unit": "ms"},
    {"name": "E2E Latency", "category": "e2e", "value": 204.0, "unit": "ms"}
  ]
}

Example Output (All Suites)

rcli bench

╔══════════════════════════════════════════╗
        RCLI Benchmark Suite
  Suite: all  Runs: 3
╚══════════════════════════════════════════╝

Initializing engine...

Benchmarking:
  LLM = Qwen3.5 2B
  STT = Whisper base.en
  TTS = Piper Lessac

--- Memory Benchmark ---
  Pool allocated: 64 MB

--- STT Benchmark ---
  Run 1: 43.2ms - "open Safari"
  Run 2: 41.8ms - "open Safari"
  Run 3: 44.1ms - "open Safari"

--- LLM Benchmark ---
  Prompt: "What is the capital of France?"
  TTFT: 22.5ms, 159.6 tok/s, 12 tokens

--- TTS Benchmark ---
  "Hello, I am your AI assistant."
  Latency: 142.3ms, RTF: 0.85

--- E2E Pipeline Benchmark ---
  E2E Latency: 204.0ms

╔════════════════════════════════════════════════════════╗
                 BENCHMARK RESULTS
╠════════════════════════════════════════════════════════╣
 [ memory ]
   Pool Size                             64.0 MB
 [ stt ]
   STT Latency (avg)                     43.0 ms        ║
   STT Real-Time Factor                   0.0 x
 [ llm ]
   LLM First Token (avg)                 20.4 ms        ║
   LLM Throughput (avg)                 173.4 tok/s     ║
 [ tts ]
   TTS Latency (avg)                    149.5 ms        ║
   TTS Real-Time Factor (avg)             0.8 x         ║
 [ e2e ]
   E2E Latency (STT+LLM_FT+TTS)         204.0 ms        ║
╚════════════════════════════════════════════════════════╝

  RCLI Benchmark complete.

Performance Targets

On Apple M3 Max (14-core CPU, 30-core GPU, 36GB RAM):
MetricTargetActual
STT latency< 50ms43.7ms
LLM TTFT< 25ms22.5ms
LLM throughput> 150 tok/s159.6 tok/s
TTS latency< 200ms150.6ms
E2E latency (TTFA)< 200ms131ms
RAG retrieval< 5ms3.82ms
Performance scales with chip generation (M1 < M2 < M3 < M4)

Build docs developers (and LLMs) love