Skip to main content

Configuration Files

RCLI stores configuration in ~/Library/RCLI/config/. All settings persist across launches.

Directory Structure

~/Library/RCLI/
├── config/
│   ├── active_models.json       # Current LLM/STT/TTS selection
│   ├── action_states.json       # Enabled/disabled actions
│   └── user_config.json         # User preferences
├── models/                      # Downloaded AI models
│   ├── llm/
│   ├── stt/
│   ├── tts/
│   ├── vad/
│   └── embedding/
└── index/                       # RAG vector index
    ├── chunks.bin
    ├── metadata.json
    └── usearch.index

active_models.json

Stores currently active model selections:
{
  "llm": {
    "model_id": "qwen3-0.6b",
    "path": "~/Library/RCLI/models/llm/qwen3-0.6b-q4_k_m.gguf",
    "gpu_layers": 99,
    "ctx_size": 4096,
    "flash_attn": true
  },
  "stt": {
    "streaming_model": "zipformer",
    "offline_model": "whisper-base.en"
  },
  "tts": {
    "model_id": "piper-lessac",
    "voice_id": "en_US-lessac-medium",
    "speaker_id": 0
  }
}
This file is auto-generated. To change models, use rcli models or press M in the TUI.

action_states.json

Tracks which of the 43 macOS actions are enabled:
{
  "enabled_actions": [
    "open_app",
    "quit_app",
    "create_note",
    "send_message",
    "play_on_spotify",
    "set_volume",
    "search_web"
  ],
  "disabled_actions": [
    "lock_screen",
    "send_message"
  ]
}
Actions can be toggled in the TUI (A → Select action → Space to toggle) or via CLI:
rcli actions enable open_app
rcli actions disable send_message

user_config.json

User preferences (planned for future release):
{
  "system_prompt": "You are a helpful voice assistant.",
  "conversation_history_turns": 10,
  "tts_speed": 1.0,
  "tool_call_trace_enabled": true,
  "mic_silence_threshold_ms": 800,
  "theme": "dark"
}

Environment Variables

Model Paths

export RCLI_MODELS_DIR="~/Library/RCLI/models"  # Default model directory

LLM Configuration

export RCLI_LLM_GPU_LAYERS=99       # GPU layers (0-99)
export RCLI_LLM_CTX_SIZE=4096       # Context window size
export RCLI_LLM_THREADS=1           # Decode threads
export RCLI_LLM_THREADS_BATCH=10    # Prompt eval threads
export RCLI_LLM_BATCH=2048          # Batch size
export RCLI_LLM_UBATCH=1024         # Micro-batch size
export RCLI_LLM_FLASH_ATTN=1        # Enable Flash Attention (1=on, 0=off)
export RCLI_LLM_USE_MLOCK=1         # Pin weights in RAM (1=on, 0=off)

STT Configuration

export RCLI_STT_SAMPLE_RATE=16000   # Audio sample rate (Hz)
export RCLI_STT_SILENCE_MS=800      # Silence detection threshold

TTS Configuration

export RCLI_TTS_SAMPLE_RATE=22050   # Audio sample rate (Hz)
export RCLI_TTS_SPEED=1.0           # Playback speed (0.5-2.0)
export RCLI_TTS_SPEAKER_ID=0        # Multi-speaker voice ID

RAG Configuration

export RCLI_RAG_INDEX_PATH="~/Library/RCLI/index"  # Index directory
export RCLI_RAG_TOP_K=5             # Number of results to retrieve
export RCLI_RAG_VECTOR_CANDIDATES=20 # Vector search candidates
export RCLI_RAG_BM25_CANDIDATES=20  # BM25 search candidates
export RCLI_RAG_RRF_K=60.0          # RRF fusion parameter
export RCLI_RAG_CACHE_SIZE_MB=256   # Embedding cache size

Memory Pool

export RCLI_POOL_SIZE_MB=64         # Pre-allocated pool size

Logging

export RCLI_LOG_LEVEL=INFO          # DEBUG, INFO, WARN, ERROR
export RCLI_VERBOSE=1               # Enable verbose logging

GPU Layers

Control how many LLM layers run on Metal GPU vs CPU.

Auto-Detection (Default)

RCLI auto-detects optimal GPU layers based on model size and available RAM:
// For models that fit in VRAM
if (model_size_mb < vram_budget_mb) {
    gpu_layers = 99;  // All layers to GPU
} else {
    gpu_layers = estimate_max_layers(model_size_mb, vram_budget_mb);
}

Manual Override

# CPU-only (slowest, lowest memory)
rcli --gpu-layers 0

# Hybrid: 20 layers on GPU, rest on CPU
rcli --gpu-layers 20

# All layers on GPU (fastest, default)
rcli --gpu-layers 99

Performance Impact

ConfigurationFirst Token (ms)Throughput (tok/s)Memory (MB)
CPU-only (0 layers)145.262.3680
Hybrid (20 layers)38.6187.41250
GPU-only (99 layers)22.5249.81450
For large models (>4B params) on devices with <32GB RAM, reduce GPU layers to avoid out-of-memory errors.

Context Size

Context window determines how much conversation history and RAG context fits in the LLM.

Supported Context Sizes

ModelMax ContextRecommended
Qwen3 0.6B32,7684,096
Qwen3.5 0.8B32,7684,096
Qwen3.5 2B32,7688,192
Qwen3.5 4B262,14416,384
LFM2 350M131,0724,096
LFM2 1.2B131,0728,192
LFM2 2.6B131,07216,384

Configuration

# Via CLI flag
rcli --ctx-size 8192

# Via environment variable
export RCLI_LLM_CTX_SIZE=8192
rcli

Memory Usage by Context Size

ContextKV Cache MemoryTotal Memory (Qwen3 0.6B)
2,04864 MB520 MB
4,096128 MB584 MB
8,192256 MB712 MB
16,384512 MB968 MB
Use smaller context sizes (2048-4096) for faster inference. Increase to 8192+ only if you need long conversation history or large RAG context.

Thread Configuration

Decode Threads

Used during token generation (autoregressive decode). Default: 1 thread (GPU-bound).
# Override (rarely needed)
export RCLI_LLM_THREADS=4
When GPU offload is enabled, LLM decode is GPU-bound. Using >1 thread can actually slow down inference due to synchronization overhead.

Prompt Eval Threads

Used during prompt processing (parallel matrix ops). Default: all P-cores.
# M3 Max (10 P-cores): auto-detected
export RCLI_LLM_THREADS_BATCH=10

# Override to use fewer cores
export RCLI_LLM_THREADS_BATCH=4

Batch Configuration

Batch Size

Number of tokens processed in parallel during prompt evaluation.
export RCLI_LLM_BATCH=2048  # Default (high RAM)
Larger batches = faster prompt processing, but more memory.

Micro-Batch Size

Internal subdivision for memory efficiency.
export RCLI_LLM_UBATCH=1024  # Half of batch size
Typically set to batch_size / 2.
RAM TierBatchUbatchUse Case
64+ GB40962048Mac Studio / Pro
32-48 GB20481024M3 Max / M2 Ultra
16-24 GB1024512M3 / M2 / M1
<16 GB512256M1 base

Flash Attention

Flash Attention reduces KV cache memory from O(n²) to O(n) and speeds up self-attention by ~20%.

Enable/Disable

# Enable (default on Metal)
export RCLI_LLM_FLASH_ATTN=1

# Disable
export RCLI_LLM_FLASH_ATTN=0

Performance Impact

ConfigurationKV Memory (8K ctx)Attention Speed
Standard Attention512 MB1.0x
Flash Attention256 MB1.2x
Flash Attention is automatically enabled on macOS with Metal GPU. Disable only if debugging attention issues.

Memory Locking (mlock)

Pins model weights in RAM to prevent swapping to disk.

Configuration

# Enable (default on systems with >=16GB RAM)
export RCLI_LLM_USE_MLOCK=1

# Disable
export RCLI_LLM_USE_MLOCK=0

Impact

  • Pros: Eliminates disk I/O latency from swap
  • Cons: Reduces available RAM for other apps
On systems with <16GB RAM, mlock can cause OOM errors. RCLI auto-disables mlock on low-RAM systems.

RAG Configuration

Vector Search Parameters

export RCLI_RAG_VECTOR_CANDIDATES=20  # HNSW candidates
export RCLI_RAG_HNSW_EF=50            # HNSW search effort
export RCLI_RAG_HNSW_M=16             # HNSW graph connectivity
Higher values = better accuracy, slower search.

BM25 Parameters

export RCLI_RAG_BM25_CANDIDATES=20  # BM25 top-k
export RCLI_RAG_BM25_K1=1.2         # Term saturation
export RCLI_RAG_BM25_B=0.75         # Length normalization

Hybrid Fusion

export RCLI_RAG_RRF_K=60.0  # Reciprocal Rank Fusion parameter
RRF formula:
score(doc) = Σ 1 / (k + rank_source(doc))
Lower k = stronger fusion. Typical range: 10-100.

Embedding Cache

export RCLI_RAG_CACHE_SIZE_MB=256  # LRU cache size
Caches query embeddings to avoid recomputation. 256 MB stores ~650K embeddings (384-dim float32).

Audio Configuration

Sample Rates

export RCLI_STT_SAMPLE_RATE=16000   # STT (16 kHz standard)
export RCLI_TTS_SAMPLE_RATE=22050   # TTS (22.05 kHz Piper default)

Buffer Sizes

export RCLI_AUDIO_CAPTURE_BUFFER=16384   # Capture ring buffer (samples)
export RCLI_AUDIO_PLAYBACK_BUFFER=44032  # Playback ring buffer (samples)
Larger buffers = more latency tolerance, higher memory.

Silence Detection

export RCLI_STT_SILENCE_MS=800  # End-of-speech silence threshold
Shorter values = faster response, more false positives.

System Prompt

Customize the LLM’s system prompt:
export RCLI_SYSTEM_PROMPT="You are a helpful voice assistant specialized in macOS automation."
Or edit ~/Library/RCLI/config/user_config.json:
{
  "system_prompt": "You are a concise assistant. Respond in 1-2 sentences."
}

Logging

Log Levels

export RCLI_LOG_LEVEL=DEBUG  # DEBUG, INFO, WARN, ERROR

Verbose Mode

rcli -v                # Enable verbose logging
rcli --verbose         # Same
Outputs:
  • STT transcription results
  • LLM token generation stats
  • TTS synthesis latency
  • Tool call traces
  • Memory pool utilization

Log Files

Logs are written to stderr. Redirect to file:
rcli 2> rcli.log

Advanced Tuning

Sentence Detection

Customize sentence boundary detection:
SentenceDetector detector(
    callback,
    min_words=3,          // Min words for primary break (. ! ?)
    max_words_sec=25,     // Max words before secondary break (; :)
    word_flush=7          // Flush at N words if no punctuation
);
See src/pipeline/sentence_detector.cpp:6.

VAD Threshold

Adjust voice activity detection sensitivity:
export RCLI_VAD_THRESHOLD=0.5  # 0.0-1.0 (higher = more strict)

Energy Floor

Minimum RMS energy to feed audio to STT:
constexpr float ENERGY_FLOOR = 0.005f;  // RMS threshold
See src/pipeline/orchestrator.cpp:778.

KV Cache Reuse

Disable KV cache reuse (forces full prompt re-eval):
export RCLI_LLM_NO_CACHE=1
Disabling KV cache increases time-to-first-token by 2-3x. Only disable for debugging.

Configuration Precedence

  1. CLI flags (highest priority)
    rcli --gpu-layers 20 --ctx-size 8192
    
  2. Environment variables
    export RCLI_LLM_GPU_LAYERS=20
    
  3. Config files (~/Library/RCLI/config/*.json)
  4. Hardware auto-detection (lowest priority)

Next Steps

Performance

Benchmark results and optimization techniques

Troubleshooting

Common issues, debugging, and logs

Build docs developers (and LLMs) love