Benchmarking

RCLI includes a comprehensive benchmark suite for measuring latency, throughput, and accuracy across all AI subsystems.

Usage

rcli bench                          # Run all benchmarks
rcli bench --suite llm              # LLM only
rcli bench --suite stt,tts          # Multiple suites
rcli bench --output results.json    # Export to JSON
rcli bench --all-llm --suite llm    # Compare all installed LLMs

Benchmark Suites

all

string

All benchmarks (default)

stt

string

Speech-to-text latency and accuracy (WER)

llm

string

Time-to-first-token, throughput, prompt caching

tts

string

Text-to-speech synthesis latency and real-time factor

e2e

string

Full pipeline: STT → LLM → TTS with parallel sentence streaming

tools

string

Tool calling accuracy and latency

rag

string

Embedding, vector search, BM25, hybrid retrieval

memory

string

Process RSS and memory pool usage

Options

--suite

string

default:"all"

Comma-separated list of suites to run

--runs

number

default:"3"

Number of measured runs per test

--output

string

Export results to JSON file

--rag

string

Load RAG index for RAG benchmarks

--all-llm

boolean

Benchmark all installed LLM models

--all-tts

boolean

Benchmark all installed TTS voices

STT Benchmark

Measures offline STT (Whisper/Parakeet) transcription latency:

rcli bench --suite stt

# Output:
--- STT Benchmark ---
  Run 1: 43.2ms - "open Safari and go to GitHub"
  Run 2: 41.8ms - "open Safari and go to GitHub"
  Run 3: 44.1ms - "open Safari and go to GitHub"

STT Latency (avg): 43.0ms
STT Audio Duration: 1920ms
STT Real-Time Factor: 0.022x

Metrics

Latency — Time to transcribe audio
Audio Duration — Length of test audio
Real-Time Factor (RTF) — latency / audio_duration (lower is better)

RTF < 0.1x means 10x faster than real-time (excellent for on-device)

LLM Benchmark

Measures LLM generation speed with various prompt lengths:

rcli bench --suite llm

# Output:
--- LLM Benchmark ---
  Prompt: "What is the capital of France?"
  Response: "The capital of France is Paris."
  First token: 22.5ms, 159.6 tok/s, 12 tokens

  Prompt: "Explain quantum computing in one sentence."
  Response: "Quantum computing uses qubits..."
  First token: 18.3ms, 187.2 tok/s, 34 tokens

LLM First Token (avg): 20.4ms
LLM Throughput (avg): 173.4 tok/s
LLM Prompt Eval: 2847 tok/s

Metrics

TTFT (Time to First Token) — Latency before first token
Throughput — Tokens generated per second
Prompt Eval — Prompt processing speed

Prompt Caching Benchmark

Tests system prompt KV caching:

--- LLM Cached Prompt Benchmark ---
  Prompt: "What is the capital of France?"
  First token: 15.2ms, 189.3 tok/s, 12 tokens

LLM Cached TTFT (avg): 15.2ms

Cached prompts are ~30% faster than uncached.

By-Length Benchmark

Tests LLM performance across response lengths:

--- LLM By Length Benchmark ---

  [Short]
    "What is the capital of France?"
    TTFT: 18ms, 203 tok/s, 12 tokens

  [Medium]
    "Explain how a car engine works."
    TTFT: 20ms, 187 tok/s, 87 tokens

  [Long]
    "Explain 5 important laws of physics."
    TTFT: 22ms, 172 tok/s, 234 tokens

TTS Benchmark

Measures TTS synthesis latency:

rcli bench --suite tts

# Output:
--- TTS Benchmark ---
  "Hello, I am your AI assistant."
  Latency: 142.3ms, Audio: 1680ms, RTF: 0.85

  "The weather today is partly cloudy."
  Latency: 156.7ms, Audio: 2100ms, RTF: 0.75

TTS Latency (avg): 149.5ms
TTS Real-Time Factor (avg): 0.80x

Metrics

Latency — Time to synthesize audio
Audio Duration — Length of synthesized audio
RTF — latency / audio_duration (< 1.0x is real-time capable)

E2E Benchmark

Measures full pipeline latency (STT → LLM → TTS):

rcli bench --suite e2e

# Output:
--- E2E Pipeline Benchmark ---
  Input: test_audio.wav
  Transcript: "open Safari and go to GitHub"
  Response: "Opening Safari and navigating to GitHub."

E2E STT: 43.2ms
E2E LLM First Token: 18.7ms
E2E LLM Total: 187.3ms
E2E TTS First Sentence: 142.1ms
E2E Latency (STT+LLM_FT+TTS): 204.0ms
E2E Total: 372.6ms

Metrics

E2E Latency (TTFA) — Time to first audio output (STT + LLM first token + TTS first sentence)
E2E Total — Complete pipeline (including full LLM + TTS)

TTFA < 200ms is the target for responsive voice AI

Long-Form E2E

Tests parallel LLM + TTS streaming:

--- E2E Long-Form Benchmark ---
  Prompt: "Explain 5 important laws of physics."
  LLM tokens: 234 at 172 tok/s
  LLM first token: 22ms
  LLM total: 1358ms
  TTS first sentence ready: 189ms
  E2E latency (first audio): 211ms
  Total (LLM+TTS complete): 1847ms

Parallel TTS ensures audio starts playing while LLM is still generating.

Tools Benchmark

Measures tool calling accuracy and latency:

rcli bench --suite tools

# Output:
--- Tool Calling Benchmark ---

  Query: "What time is it right now?"
    First pass: 187.3ms, detected=yes (tool=get_current_time)
    Tool exec: 12.4ms, result={"time": "2:30 PM"}
    Second pass: 156.8ms
    Response: "It's 2:30 PM."
    Total: 356.5ms

  Query: "Calculate 42 plus 17"
    First pass: 174.2ms, detected=yes (tool=calculate)
    Tool exec: 8.1ms, result={"result": 59}
    Second pass: 142.3ms
    Response: "42 plus 17 equals 59."
    Total: 324.6ms

Tool Success Rate: 100%

Metrics

First pass — LLM generates tool call
Tool exec — Action execution time
Second pass — LLM generates natural language response
Total — End-to-end tool calling latency
Success Rate — Percentage of correct tool detections

RAG Benchmark

Measures retrieval performance:

rcli bench --suite rag --rag ~/Library/RCLI/index

# Output:
--- RAG Benchmark ---
  Embedding: 7.8ms
  Vector search: 2.1ms (5000 chunks)
  BM25 search: 0.9ms
  RRF fusion: 0.4ms
  Total retrieval: 3.8ms

  Full RAG query: 234.7ms
    • Retrieval: 3.8ms
    • LLM: 230.9ms (87 tok, 189 tok/s)

Metrics

Embedding — Query embedding time
Vector search — HNSW index search
BM25 search — Full-text search
RRF fusion — Reciprocal rank fusion
Total retrieval — Combined retrieval latency

Memory Benchmark

Reports process memory usage:

--- Memory Benchmark ---
  Pool allocated: 64 MB
  Process RSS: 1.2 GB

Comparing Models

All LLMs

rcli bench --all-llm --suite llm

# Output:
--- LLM Benchmark (All Models) ---

  [LFM2 350M]
    TTFT: 12ms, 350 tok/s

  [Qwen3 0.6B]
    TTFT: 18ms, 250 tok/s

  [LFM2 1.2B Tool]
    TTFT: 22ms, 180 tok/s

  [Qwen3.5 2B]
    TTFT: 25ms, 150 tok/s

  [Qwen3.5 4B]
    TTFT: 42ms, 75 tok/s

All TTS

rcli bench --all-tts --suite tts

# Output:
--- TTS Benchmark (All Voices) ---

  [Piper Lessac]
    Latency: 142ms, RTF: 0.8x

  [Piper Amy]
    Latency: 138ms, RTF: 0.7x

  [Kokoro English]
    Latency: 189ms, RTF: 1.1x

JSON Export

rcli bench --output results.json

# results.json:
{
  "results": [
    {"name": "STT Latency (avg)", "category": "stt", "value": 43.0, "unit": "ms"},
    {"name": "LLM First Token (avg)", "category": "llm", "value": 20.4, "unit": "ms"},
    {"name": "LLM Throughput (avg)", "category": "llm", "value": 173.4, "unit": "tok/s"},
    {"name": "TTS Latency (avg)", "category": "tts", "value": 149.5, "unit": "ms"},
    {"name": "E2E Latency", "category": "e2e", "value": 204.0, "unit": "ms"}
  ]
}

Example Output (All Suites)

rcli bench

╔══════════════════════════════════════════╗
║        RCLI Benchmark Suite              ║
║  Suite: all  Runs: 3                     ║
╚══════════════════════════════════════════╝

Initializing engine...

Benchmarking:
  LLM = Qwen3.5 2B
  STT = Whisper base.en
  TTS = Piper Lessac

--- Memory Benchmark ---
  Pool allocated: 64 MB

--- STT Benchmark ---
  Run 1: 43.2ms - "open Safari"
  Run 2: 41.8ms - "open Safari"
  Run 3: 44.1ms - "open Safari"

--- LLM Benchmark ---
  Prompt: "What is the capital of France?"
  TTFT: 22.5ms, 159.6 tok/s, 12 tokens

--- TTS Benchmark ---
  "Hello, I am your AI assistant."
  Latency: 142.3ms, RTF: 0.85

--- E2E Pipeline Benchmark ---
  E2E Latency: 204.0ms

╔════════════════════════════════════════════════════════╗
║                 BENCHMARK RESULTS                      ║
╠════════════════════════════════════════════════════════╣
║ [ memory ]                                             ║
║   Pool Size                             64.0 MB        ║
║ [ stt ]                                                ║
║   STT Latency (avg)                     43.0 ms        ║
║   STT Real-Time Factor                   0.0 x         ║
║ [ llm ]                                                ║
║   LLM First Token (avg)                 20.4 ms        ║
║   LLM Throughput (avg)                 173.4 tok/s     ║
║ [ tts ]                                                ║
║   TTS Latency (avg)                    149.5 ms        ║
║   TTS Real-Time Factor (avg)             0.8 x         ║
║ [ e2e ]                                                ║
║   E2E Latency (STT+LLM_FT+TTS)         204.0 ms        ║
╚════════════════════════════════════════════════════════╝

  RCLI — Benchmark complete.

Performance Targets

On Apple M3 Max (14-core CPU, 30-core GPU, 36GB RAM):

Metric	Target	Actual
STT latency	< 50ms	43.7ms
LLM TTFT	< 25ms	22.5ms
LLM throughput	> 150 tok/s	159.6 tok/s
TTS latency	< 200ms	150.6ms
E2E latency (TTFA)	< 200ms	131ms
RAG retrieval	< 5ms	3.82ms

Performance scales with chip generation (M1 < M2 < M3 < M4)

Get Started

Core Features

Commands

Models

Actions

Advanced

Development

Benchmarking

Usage

Benchmark Suites

Options

STT Benchmark

Metrics

LLM Benchmark

Metrics

Prompt Caching Benchmark

By-Length Benchmark

TTS Benchmark

Metrics

E2E Benchmark

Metrics

Long-Form E2E

Tools Benchmark

Metrics

RAG Benchmark

Metrics

Memory Benchmark

Comparing Models

All LLMs

All TTS

JSON Export

Example Output (All Suites)

Performance Targets

Build docs developers (and LLMs) love

Get Started

Core Features

Commands

Models

Actions

Advanced

Development

​Usage

​Benchmark Suites

​Options

​STT Benchmark

​Metrics

​LLM Benchmark

​Metrics

​Prompt Caching Benchmark

​By-Length Benchmark

​TTS Benchmark

​Metrics

​E2E Benchmark

​Metrics

​Long-Form E2E

​Tools Benchmark

​Metrics

​RAG Benchmark

​Metrics

​Memory Benchmark

​Comparing Models

​All LLMs

​All TTS

​JSON Export

​Example Output (All Suites)

​Performance Targets

Build docs developers (and LLMs) love

Usage

Benchmark Suites

Options

STT Benchmark

Metrics

LLM Benchmark

Metrics

Prompt Caching Benchmark

By-Length Benchmark

TTS Benchmark

Metrics

E2E Benchmark

Metrics

Long-Form E2E

Tools Benchmark

Metrics

RAG Benchmark

Metrics

Memory Benchmark

Comparing Models

All LLMs

All TTS

JSON Export

Example Output (All Suites)

Performance Targets