Skip to main content
Continuous voice mode listens for speech, transcribes, processes commands, and speaks responses — all without manual intervention. Ideal for hands-free workflows.

Usage

rcli listen

# With custom options
rcli listen --no-speak --gpu-layers 50

# With RAG index loaded
rcli listen --rag ~/Library/RCLI/index

How It Works

  1. Always Listening — VAD (Silero) detects speech activity
  2. Streaming STT — Zipformer transcribes live audio
  3. Endpoint Detection — Detects silence, finalizes transcript
  4. LLM Processing — Classifies intent, executes actions, generates response
  5. TTS Playback — Speaks the result via Piper/Kokoro
  6. Loop — Returns to listening state

Voice Pipeline States

The terminal displays an animated dog indicating the current state:
  /^ ^\
 ( o.o )
  > ^ <
   |_|

Listening for speech...

Supported Commands

macOS Actions

"open Safari"
"set volume to 50"
"create a note called Meeting Notes"
"play some jazz on Spotify"
"take a screenshot"
"lock screen"

Conversational Queries

"what is the weather today?"
"explain quantum computing"
"what's on my calendar?"

RAG Queries (with —rag)

"what were the key decisions from the meeting?"
"summarize the project plan"

Options

--models
string
default:"~/Library/RCLI/models"
Models directory path
--gpu-layers
number
default:"99"
GPU layers for LLM (99 = all, 0 = CPU only)
--no-speak
boolean
Disable TTS audio output (text only)
--rag
string
Load RAG index for document-grounded answers
--verbose
boolean
Show debug logs from engines

Voice Activity Detection (VAD)

Listen mode uses Silero VAD to:
  • Filter background noise
  • Detect speech start/end
  • Trigger transcription only when speaking
This reduces false positives and improves transcription accuracy.

STT Models

Listen mode uses two STT models in parallel:

Zipformer (Streaming)

  • Purpose — Real-time transcription during speech
  • Speed — ~50ms latency
  • Accuracy — Good for live feedback
  • Size — ~50 MB

Whisper/Parakeet (Offline)

  • Purpose — Final accurate transcription after speech ends
  • Speed — ~40ms for Whisper base.en, ~60ms for Parakeet
  • Accuracy — Higher (Whisper ~5% WER, Parakeet ~1.9% WER)
  • Size — 140 MB (Whisper), 640 MB (Parakeet)
The offline model’s transcript is used for LLM processing.

Performance Metrics

After each interaction, listen mode prints:
STT: 43.2ms
LLM: 87 tok  189 tok/s  TTFT 18ms
TTS: 142ms  0.8x RT
  • STT — Transcription time
  • LLM — Tokens, throughput, time-to-first-token
  • TTS — Synthesis time, real-time factor

Stopping Listen Mode

Press Ctrl+C to gracefully stop:
RCLI See you next time!

Example Session

$ rcli listen

  RCLI Continuous Voice Mode
  Speak naturally. RCLI listens, acts, and responds.
  Press Ctrl+C to stop.

  Ready.

  /^ ^\
 ( o.o )
  > ^ <
   |_|

Listening for speech...

# User speaks: "open Safari"

  You: open Safari

  /^ ^\
 ( -.- )
  > ^ <
   |_|

Thinking...

  RCLI: Done! Safari is now open.

STT: 39ms
LLM: 12 tok  203 tok/s  TTFT 15ms
TTS: 98ms  0.6x RT

  /^ ^\
 ( o.o )
  > ^ <
   |_|

Listening for speech...

# User speaks: "what's the weather?"

  You: what's the weather?

  RCLI: I don't have access to real-time weather data, 
  but you can ask Siri or check a weather app.

STT: 41ms
LLM: 31 tok  178 tok/s  TTFT 19ms
TTS: 187ms  0.9x RT

^C
  RCLI See you next time!

Troubleshooting

No Speech Detected

If RCLI doesn’t respond:
  1. Check microphone permission — System Settings > Privacy & Security > Microphone
  2. Test mic levelsrcli mic-test
  3. Speak louder or closer — VAD threshold is 0.003 RMS

Slow Response

If TTFT > 100ms:
  1. Use smaller LLMrcli models → select Qwen3 0.6B or LFM2 350M
  2. Increase GPU layers--gpu-layers 99 (default)
  3. Check system load — Close other GPU-heavy apps

Incorrect Transcription

If STT accuracy is low:
  1. Upgrade STT modelrcli upgrade-stt (Parakeet TDT, ~1.9% WER)
  2. Speak clearly — Avoid background noise
  3. Check mic quality — Built-in MacBook mic is sufficient

Advanced Usage

Custom System Prompt

Modify ~/Library/RCLI/config/system_prompt.txt to change LLM behavior:
You are a helpful voice assistant. Keep responses brief and natural.

Action Filtering

Disable noisy actions:
rcli actions
# Use TUI to toggle actions on/off
# Or edit ~/.rcli/actions.json

Multi-Turn Conversations

Listen mode maintains conversation history with sliding window token-budget trimming:
You: "what's 2 plus 2?"
RCLI: "Four."
You: "multiply that by 10"
RCLI: "Forty."

Build docs developers (and LLMs) love