Listen Mode

Continuous voice mode listens for speech, transcribes, processes commands, and speaks responses — all without manual intervention. Ideal for hands-free workflows.

Usage

rcli listen

# With custom options
rcli listen --no-speak --gpu-layers 50

# With RAG index loaded
rcli listen --rag ~/Library/RCLI/index

How It Works

Always Listening — VAD (Silero) detects speech activity
Streaming STT — Zipformer transcribes live audio
Endpoint Detection — Detects silence, finalizes transcript
LLM Processing — Classifies intent, executes actions, generates response
TTS Playback — Speaks the result via Piper/Kokoro
Loop — Returns to listening state

Voice Pipeline States

The terminal displays an animated dog indicating the current state:

  /^ ^\
 ( o.o )
  > ^ <
   |_|

Listening for speech...

Supported Commands

macOS Actions

"open Safari"
"set volume to 50"
"create a note called Meeting Notes"
"play some jazz on Spotify"
"take a screenshot"
"lock screen"

Conversational Queries

"what is the weather today?"
"explain quantum computing"
"what's on my calendar?"

RAG Queries (with —rag)

"what were the key decisions from the meeting?"
"summarize the project plan"

Options

--models

string

default:"~/Library/RCLI/models"

Models directory path

--gpu-layers

number

default:"99"

GPU layers for LLM (99 = all, 0 = CPU only)

--no-speak

boolean

Disable TTS audio output (text only)

--rag

string

Load RAG index for document-grounded answers

--verbose

boolean

Show debug logs from engines

Voice Activity Detection (VAD)

Listen mode uses Silero VAD to:

Filter background noise
Detect speech start/end
Trigger transcription only when speaking

This reduces false positives and improves transcription accuracy.

STT Models

Listen mode uses two STT models in parallel:

Zipformer (Streaming)

Purpose — Real-time transcription during speech
Speed — ~50ms latency
Accuracy — Good for live feedback
Size — ~50 MB

Whisper/Parakeet (Offline)

Purpose — Final accurate transcription after speech ends
Speed — ~40ms for Whisper base.en, ~60ms for Parakeet
Accuracy — Higher (Whisper ~5% WER, Parakeet ~1.9% WER)
Size — 140 MB (Whisper), 640 MB (Parakeet)

The offline model’s transcript is used for LLM processing.

Performance Metrics

After each interaction, listen mode prints:

STT: 43.2ms
LLM: 87 tok  189 tok/s  TTFT 18ms
TTS: 142ms  0.8x RT

STT — Transcription time
LLM — Tokens, throughput, time-to-first-token
TTS — Synthesis time, real-time factor

Stopping Listen Mode

Press Ctrl+C to gracefully stop:

RCLI — See you next time!

Example Session

$ rcli listen

  RCLI — Continuous Voice Mode
  Speak naturally. RCLI listens, acts, and responds.
  Press Ctrl+C to stop.

  Ready.

  /^ ^\
 ( o.o )
  > ^ <
   |_|

Listening for speech...

# User speaks: "open Safari"

  You: open Safari

  /^ ^\
 ( -.- )
  > ^ <
   |_|

Thinking...

  RCLI: Done! Safari is now open.

STT: 39ms
LLM: 12 tok  203 tok/s  TTFT 15ms
TTS: 98ms  0.6x RT

  /^ ^\
 ( o.o )
  > ^ <
   |_|

Listening for speech...

# User speaks: "what's the weather?"

  You: what's the weather?

  RCLI: I don't have access to real-time weather data, 
  but you can ask Siri or check a weather app.

STT: 41ms
LLM: 31 tok  178 tok/s  TTFT 19ms
TTS: 187ms  0.9x RT

^C
  RCLI — See you next time!

Troubleshooting

No Speech Detected

If RCLI doesn’t respond:

Check microphone permission — System Settings > Privacy & Security > Microphone
Test mic levels — rcli mic-test
Speak louder or closer — VAD threshold is 0.003 RMS

Slow Response

If TTFT > 100ms:

Use smaller LLM — rcli models → select Qwen3 0.6B or LFM2 350M
Increase GPU layers — --gpu-layers 99 (default)
Check system load — Close other GPU-heavy apps

Incorrect Transcription

If STT accuracy is low:

Upgrade STT model — rcli upgrade-stt (Parakeet TDT, ~1.9% WER)
Speak clearly — Avoid background noise
Check mic quality — Built-in MacBook mic is sufficient

Advanced Usage

Custom System Prompt

Modify ~/Library/RCLI/config/system_prompt.txt to change LLM behavior:

You are a helpful voice assistant. Keep responses brief and natural.

Action Filtering

Disable noisy actions:

rcli actions
# Use TUI to toggle actions on/off
# Or edit ~/.rcli/actions.json

Multi-Turn Conversations

Listen mode maintains conversation history with sliding window token-budget trimming:

You: "what's 2 plus 2?"
RCLI: "Four."
You: "multiply that by 10"
RCLI: "Forty."

Get Started

Core Features

Commands

Models

Actions

Advanced

Development

Listen Mode

Usage

How It Works

Voice Pipeline States

Supported Commands

macOS Actions

Conversational Queries

RAG Queries (with —rag)

Options

Voice Activity Detection (VAD)

STT Models

Zipformer (Streaming)

Whisper/Parakeet (Offline)

Performance Metrics

Stopping Listen Mode

Example Session

Troubleshooting

No Speech Detected

Slow Response

Incorrect Transcription

Advanced Usage

Custom System Prompt

Action Filtering

Multi-Turn Conversations

Build docs developers (and LLMs) love

Get Started

Core Features

Commands

Models

Actions

Advanced

Development

​Usage

​How It Works

​Voice Pipeline States

​Supported Commands

​macOS Actions

​Conversational Queries

​RAG Queries (with —rag)

​Options

​Voice Activity Detection (VAD)

​STT Models

​Zipformer (Streaming)

​Whisper/Parakeet (Offline)

​Performance Metrics

​Stopping Listen Mode

​Example Session

​Troubleshooting

​No Speech Detected

​Slow Response

​Incorrect Transcription

​Advanced Usage

​Custom System Prompt

​Action Filtering

​Multi-Turn Conversations

Build docs developers (and LLMs) love

Usage

How It Works

Voice Pipeline States

Supported Commands

macOS Actions

Conversational Queries

RAG Queries (with —rag)

Options

Voice Activity Detection (VAD)

STT Models

Zipformer (Streaming)

Whisper/Parakeet (Offline)

Performance Metrics

Stopping Listen Mode

Example Session

Troubleshooting

No Speech Detected

Slow Response

Incorrect Transcription

Advanced Usage

Custom System Prompt

Action Filtering

Multi-Turn Conversations