Configuration Files
RCLI stores configuration in ~/Library/RCLI/config/. All settings persist across launches.
Directory Structure
~/Library/RCLI/
├── config/
│ ├── active_models.json # Current LLM/STT/TTS selection
│ ├── action_states.json # Enabled/disabled actions
│ └── user_config.json # User preferences
├── models/ # Downloaded AI models
│ ├── llm/
│ ├── stt/
│ ├── tts/
│ ├── vad/
│ └── embedding/
└── index/ # RAG vector index
├── chunks.bin
├── metadata.json
└── usearch.index
active_models.json
Stores currently active model selections:
{
"llm" : {
"model_id" : "qwen3-0.6b" ,
"path" : "~/Library/RCLI/models/llm/qwen3-0.6b-q4_k_m.gguf" ,
"gpu_layers" : 99 ,
"ctx_size" : 4096 ,
"flash_attn" : true
},
"stt" : {
"streaming_model" : "zipformer" ,
"offline_model" : "whisper-base.en"
},
"tts" : {
"model_id" : "piper-lessac" ,
"voice_id" : "en_US-lessac-medium" ,
"speaker_id" : 0
}
}
This file is auto-generated. To change models, use rcli models or press M in the TUI.
action_states.json
Tracks which of the 43 macOS actions are enabled:
{
"enabled_actions" : [
"open_app" ,
"quit_app" ,
"create_note" ,
"send_message" ,
"play_on_spotify" ,
"set_volume" ,
"search_web"
],
"disabled_actions" : [
"lock_screen" ,
"send_message"
]
}
Actions can be toggled in the TUI (A → Select action → Space to toggle) or via CLI:
rcli actions enable open_app
rcli actions disable send_message
user_config.json
User preferences (planned for future release):
{
"system_prompt" : "You are a helpful voice assistant." ,
"conversation_history_turns" : 10 ,
"tts_speed" : 1.0 ,
"tool_call_trace_enabled" : true ,
"mic_silence_threshold_ms" : 800 ,
"theme" : "dark"
}
Environment Variables
Model Paths
export RCLI_MODELS_DIR = "~/Library/RCLI/models" # Default model directory
LLM Configuration
export RCLI_LLM_GPU_LAYERS = 99 # GPU layers (0-99)
export RCLI_LLM_CTX_SIZE = 4096 # Context window size
export RCLI_LLM_THREADS = 1 # Decode threads
export RCLI_LLM_THREADS_BATCH = 10 # Prompt eval threads
export RCLI_LLM_BATCH = 2048 # Batch size
export RCLI_LLM_UBATCH = 1024 # Micro-batch size
export RCLI_LLM_FLASH_ATTN = 1 # Enable Flash Attention (1=on, 0=off)
export RCLI_LLM_USE_MLOCK = 1 # Pin weights in RAM (1=on, 0=off)
STT Configuration
export RCLI_STT_SAMPLE_RATE = 16000 # Audio sample rate (Hz)
export RCLI_STT_SILENCE_MS = 800 # Silence detection threshold
TTS Configuration
export RCLI_TTS_SAMPLE_RATE = 22050 # Audio sample rate (Hz)
export RCLI_TTS_SPEED = 1.0 # Playback speed (0.5-2.0)
export RCLI_TTS_SPEAKER_ID = 0 # Multi-speaker voice ID
RAG Configuration
export RCLI_RAG_INDEX_PATH = "~/Library/RCLI/index" # Index directory
export RCLI_RAG_TOP_K = 5 # Number of results to retrieve
export RCLI_RAG_VECTOR_CANDIDATES = 20 # Vector search candidates
export RCLI_RAG_BM25_CANDIDATES = 20 # BM25 search candidates
export RCLI_RAG_RRF_K = 60.0 # RRF fusion parameter
export RCLI_RAG_CACHE_SIZE_MB = 256 # Embedding cache size
Memory Pool
export RCLI_POOL_SIZE_MB = 64 # Pre-allocated pool size
Logging
export RCLI_LOG_LEVEL = INFO # DEBUG, INFO, WARN, ERROR
export RCLI_VERBOSE = 1 # Enable verbose logging
GPU Layers
Control how many LLM layers run on Metal GPU vs CPU.
Auto-Detection (Default)
RCLI auto-detects optimal GPU layers based on model size and available RAM:
// For models that fit in VRAM
if (model_size_mb < vram_budget_mb) {
gpu_layers = 99 ; // All layers to GPU
} else {
gpu_layers = estimate_max_layers (model_size_mb, vram_budget_mb);
}
Manual Override
# CPU-only (slowest, lowest memory)
rcli --gpu-layers 0
# Hybrid: 20 layers on GPU, rest on CPU
rcli --gpu-layers 20
# All layers on GPU (fastest, default)
rcli --gpu-layers 99
Configuration First Token (ms) Throughput (tok/s) Memory (MB) CPU-only (0 layers) 145.2 62.3 680 Hybrid (20 layers) 38.6 187.4 1250 GPU-only (99 layers) 22.5 249.8 1450
For large models (>4B params) on devices with <32GB RAM, reduce GPU layers to avoid out-of-memory errors.
Context Size
Context window determines how much conversation history and RAG context fits in the LLM.
Supported Context Sizes
Model Max Context Recommended Qwen3 0.6B 32,768 4,096 Qwen3.5 0.8B 32,768 4,096 Qwen3.5 2B 32,768 8,192 Qwen3.5 4B 262,144 16,384 LFM2 350M 131,072 4,096 LFM2 1.2B 131,072 8,192 LFM2 2.6B 131,072 16,384
Configuration
# Via CLI flag
rcli --ctx-size 8192
# Via environment variable
export RCLI_LLM_CTX_SIZE = 8192
rcli
Memory Usage by Context Size
Context KV Cache Memory Total Memory (Qwen3 0.6B) 2,048 64 MB 520 MB 4,096 128 MB 584 MB 8,192 256 MB 712 MB 16,384 512 MB 968 MB
Use smaller context sizes (2048-4096) for faster inference. Increase to 8192+ only if you need long conversation history or large RAG context.
Thread Configuration
Decode Threads
Used during token generation (autoregressive decode). Default: 1 thread (GPU-bound).
# Override (rarely needed)
export RCLI_LLM_THREADS = 4
When GPU offload is enabled, LLM decode is GPU-bound. Using >1 thread can actually slow down inference due to synchronization overhead.
Prompt Eval Threads
Used during prompt processing (parallel matrix ops). Default: all P-cores .
# M3 Max (10 P-cores): auto-detected
export RCLI_LLM_THREADS_BATCH = 10
# Override to use fewer cores
export RCLI_LLM_THREADS_BATCH = 4
Batch Configuration
Batch Size
Number of tokens processed in parallel during prompt evaluation.
export RCLI_LLM_BATCH = 2048 # Default (high RAM)
Larger batches = faster prompt processing, but more memory.
Micro-Batch Size
Internal subdivision for memory efficiency.
export RCLI_LLM_UBATCH = 1024 # Half of batch size
Typically set to batch_size / 2.
Recommended Batch Sizes
RAM Tier Batch Ubatch Use Case 64+ GB 4096 2048 Mac Studio / Pro 32-48 GB 2048 1024 M3 Max / M2 Ultra 16-24 GB 1024 512 M3 / M2 / M1 <16 GB 512 256 M1 base
Flash Attention
Flash Attention reduces KV cache memory from O(n²) to O(n) and speeds up self-attention by ~20%.
Enable/Disable
# Enable (default on Metal)
export RCLI_LLM_FLASH_ATTN = 1
# Disable
export RCLI_LLM_FLASH_ATTN = 0
Configuration KV Memory (8K ctx) Attention Speed Standard Attention 512 MB 1.0x Flash Attention 256 MB 1.2x
Flash Attention is automatically enabled on macOS with Metal GPU. Disable only if debugging attention issues.
Memory Locking (mlock)
Pins model weights in RAM to prevent swapping to disk.
Configuration
# Enable (default on systems with >=16GB RAM)
export RCLI_LLM_USE_MLOCK = 1
# Disable
export RCLI_LLM_USE_MLOCK = 0
Impact
Pros: Eliminates disk I/O latency from swap
Cons: Reduces available RAM for other apps
On systems with <16GB RAM, mlock can cause OOM errors. RCLI auto-disables mlock on low-RAM systems.
RAG Configuration
Vector Search Parameters
export RCLI_RAG_VECTOR_CANDIDATES = 20 # HNSW candidates
export RCLI_RAG_HNSW_EF = 50 # HNSW search effort
export RCLI_RAG_HNSW_M = 16 # HNSW graph connectivity
Higher values = better accuracy, slower search.
BM25 Parameters
export RCLI_RAG_BM25_CANDIDATES = 20 # BM25 top-k
export RCLI_RAG_BM25_K1 = 1.2 # Term saturation
export RCLI_RAG_BM25_B = 0.75 # Length normalization
Hybrid Fusion
export RCLI_RAG_RRF_K = 60.0 # Reciprocal Rank Fusion parameter
RRF formula:
score(doc) = Σ 1 / (k + rank_source(doc))
Lower k = stronger fusion. Typical range: 10-100.
Embedding Cache
export RCLI_RAG_CACHE_SIZE_MB = 256 # LRU cache size
Caches query embeddings to avoid recomputation. 256 MB stores ~650K embeddings (384-dim float32).
Audio Configuration
Sample Rates
export RCLI_STT_SAMPLE_RATE = 16000 # STT (16 kHz standard)
export RCLI_TTS_SAMPLE_RATE = 22050 # TTS (22.05 kHz Piper default)
Buffer Sizes
export RCLI_AUDIO_CAPTURE_BUFFER = 16384 # Capture ring buffer (samples)
export RCLI_AUDIO_PLAYBACK_BUFFER = 44032 # Playback ring buffer (samples)
Larger buffers = more latency tolerance, higher memory.
Silence Detection
export RCLI_STT_SILENCE_MS = 800 # End-of-speech silence threshold
Shorter values = faster response, more false positives.
System Prompt
Customize the LLM’s system prompt:
export RCLI_SYSTEM_PROMPT = "You are a helpful voice assistant specialized in macOS automation."
Or edit ~/Library/RCLI/config/user_config.json:
{
"system_prompt" : "You are a concise assistant. Respond in 1-2 sentences."
}
Logging
Log Levels
export RCLI_LOG_LEVEL = DEBUG # DEBUG, INFO, WARN, ERROR
Verbose Mode
rcli -v # Enable verbose logging
rcli --verbose # Same
Outputs:
STT transcription results
LLM token generation stats
TTS synthesis latency
Tool call traces
Memory pool utilization
Log Files
Logs are written to stderr. Redirect to file:
Advanced Tuning
Sentence Detection
Customize sentence boundary detection:
SentenceDetector detector (
callback ,
min_words = 3 , // Min words for primary break (. ! ?)
max_words_sec = 25 , // Max words before secondary break (; :)
word_flush = 7 // Flush at N words if no punctuation
);
See src/pipeline/sentence_detector.cpp:6.
VAD Threshold
Adjust voice activity detection sensitivity:
export RCLI_VAD_THRESHOLD = 0.5 # 0.0-1.0 (higher = more strict)
Energy Floor
Minimum RMS energy to feed audio to STT:
constexpr float ENERGY_FLOOR = 0.005 f ; // RMS threshold
See src/pipeline/orchestrator.cpp:778.
KV Cache Reuse
Disable KV cache reuse (forces full prompt re-eval):
export RCLI_LLM_NO_CACHE = 1
Disabling KV cache increases time-to-first-token by 2-3x. Only disable for debugging.
Configuration Precedence
CLI flags (highest priority)
rcli --gpu-layers 20 --ctx-size 8192
Environment variables
export RCLI_LLM_GPU_LAYERS = 20
Config files (~/Library/RCLI/config/*.json)
Hardware auto-detection (lowest priority)
Next Steps
Performance Benchmark results and optimization techniques
Troubleshooting Common issues, debugging, and logs