Configuration

Configuration Files

RCLI stores configuration in ~/Library/RCLI/config/. All settings persist across launches.

Directory Structure

~/Library/RCLI/
├── config/
│   ├── active_models.json       # Current LLM/STT/TTS selection
│   ├── action_states.json       # Enabled/disabled actions
│   └── user_config.json         # User preferences
├── models/                      # Downloaded AI models
│   ├── llm/
│   ├── stt/
│   ├── tts/
│   ├── vad/
│   └── embedding/
└── index/                       # RAG vector index
    ├── chunks.bin
    ├── metadata.json
    └── usearch.index

active_models.json

Stores currently active model selections:

{
  "llm": {
    "model_id": "qwen3-0.6b",
    "path": "~/Library/RCLI/models/llm/qwen3-0.6b-q4_k_m.gguf",
    "gpu_layers": 99,
    "ctx_size": 4096,
    "flash_attn": true
  },
  "stt": {
    "streaming_model": "zipformer",
    "offline_model": "whisper-base.en"
  },
  "tts": {
    "model_id": "piper-lessac",
    "voice_id": "en_US-lessac-medium",
    "speaker_id": 0
  }
}

This file is auto-generated. To change models, use rcli models or press M in the TUI.

action_states.json

Tracks which of the 43 macOS actions are enabled:

{
  "enabled_actions": [
    "open_app",
    "quit_app",
    "create_note",
    "send_message",
    "play_on_spotify",
    "set_volume",
    "search_web"
  ],
  "disabled_actions": [
    "lock_screen",
    "send_message"
  ]
}

Actions can be toggled in the TUI (A → Select action → Space to toggle) or via CLI:

rcli actions enable open_app
rcli actions disable send_message

user_config.json

User preferences (planned for future release):

{
  "system_prompt": "You are a helpful voice assistant.",
  "conversation_history_turns": 10,
  "tts_speed": 1.0,
  "tool_call_trace_enabled": true,
  "mic_silence_threshold_ms": 800,
  "theme": "dark"
}

Environment Variables

Model Paths

export RCLI_MODELS_DIR="~/Library/RCLI/models"  # Default model directory

LLM Configuration

export RCLI_LLM_GPU_LAYERS=99       # GPU layers (0-99)
export RCLI_LLM_CTX_SIZE=4096       # Context window size
export RCLI_LLM_THREADS=1           # Decode threads
export RCLI_LLM_THREADS_BATCH=10    # Prompt eval threads
export RCLI_LLM_BATCH=2048          # Batch size
export RCLI_LLM_UBATCH=1024         # Micro-batch size
export RCLI_LLM_FLASH_ATTN=1        # Enable Flash Attention (1=on, 0=off)
export RCLI_LLM_USE_MLOCK=1         # Pin weights in RAM (1=on, 0=off)

STT Configuration

export RCLI_STT_SAMPLE_RATE=16000   # Audio sample rate (Hz)
export RCLI_STT_SILENCE_MS=800      # Silence detection threshold

TTS Configuration

export RCLI_TTS_SAMPLE_RATE=22050   # Audio sample rate (Hz)
export RCLI_TTS_SPEED=1.0           # Playback speed (0.5-2.0)
export RCLI_TTS_SPEAKER_ID=0        # Multi-speaker voice ID

RAG Configuration

export RCLI_RAG_INDEX_PATH="~/Library/RCLI/index"  # Index directory
export RCLI_RAG_TOP_K=5             # Number of results to retrieve
export RCLI_RAG_VECTOR_CANDIDATES=20 # Vector search candidates
export RCLI_RAG_BM25_CANDIDATES=20  # BM25 search candidates
export RCLI_RAG_RRF_K=60.0          # RRF fusion parameter
export RCLI_RAG_CACHE_SIZE_MB=256   # Embedding cache size

Memory Pool

export RCLI_POOL_SIZE_MB=64         # Pre-allocated pool size

Logging

export RCLI_LOG_LEVEL=INFO          # DEBUG, INFO, WARN, ERROR
export RCLI_VERBOSE=1               # Enable verbose logging

GPU Layers

Control how many LLM layers run on Metal GPU vs CPU.

Auto-Detection (Default)

RCLI auto-detects optimal GPU layers based on model size and available RAM:

// For models that fit in VRAM
if (model_size_mb < vram_budget_mb) {
    gpu_layers = 99;  // All layers to GPU
} else {
    gpu_layers = estimate_max_layers(model_size_mb, vram_budget_mb);
}

Manual Override

# CPU-only (slowest, lowest memory)
rcli --gpu-layers 0

# Hybrid: 20 layers on GPU, rest on CPU
rcli --gpu-layers 20

# All layers on GPU (fastest, default)
rcli --gpu-layers 99

Performance Impact

Configuration	First Token (ms)	Throughput (tok/s)	Memory (MB)
CPU-only (0 layers)	145.2	62.3	680
Hybrid (20 layers)	38.6	187.4	1250
GPU-only (99 layers)	22.5	249.8	1450

For large models (>4B params) on devices with <32GB RAM, reduce GPU layers to avoid out-of-memory errors.

Context Size

Context window determines how much conversation history and RAG context fits in the LLM.

Supported Context Sizes

Model	Max Context	Recommended
Qwen3 0.6B	32,768	4,096
Qwen3.5 0.8B	32,768	4,096
Qwen3.5 2B	32,768	8,192
Qwen3.5 4B	262,144	16,384
LFM2 350M	131,072	4,096
LFM2 1.2B	131,072	8,192
LFM2 2.6B	131,072	16,384

Configuration

# Via CLI flag
rcli --ctx-size 8192

# Via environment variable
export RCLI_LLM_CTX_SIZE=8192
rcli

Memory Usage by Context Size

Context	KV Cache Memory	Total Memory (Qwen3 0.6B)
2,048	64 MB	520 MB
4,096	128 MB	584 MB
8,192	256 MB	712 MB
16,384	512 MB	968 MB

Use smaller context sizes (2048-4096) for faster inference. Increase to 8192+ only if you need long conversation history or large RAG context.

Thread Configuration

Decode Threads

Used during token generation (autoregressive decode). Default: 1 thread (GPU-bound).

# Override (rarely needed)
export RCLI_LLM_THREADS=4

When GPU offload is enabled, LLM decode is GPU-bound. Using >1 thread can actually slow down inference due to synchronization overhead.

Prompt Eval Threads

Used during prompt processing (parallel matrix ops). Default: all P-cores.

# M3 Max (10 P-cores): auto-detected
export RCLI_LLM_THREADS_BATCH=10

# Override to use fewer cores
export RCLI_LLM_THREADS_BATCH=4

Batch Configuration

Batch Size

Number of tokens processed in parallel during prompt evaluation.

export RCLI_LLM_BATCH=2048  # Default (high RAM)

Larger batches = faster prompt processing, but more memory.

Micro-Batch Size

Internal subdivision for memory efficiency.

export RCLI_LLM_UBATCH=1024  # Half of batch size

Typically set to batch_size / 2.

Recommended Batch Sizes

RAM Tier	Batch	Ubatch	Use Case
64+ GB	4096	2048	Mac Studio / Pro
32-48 GB	2048	1024	M3 Max / M2 Ultra
16-24 GB	1024	512	M3 / M2 / M1
<16 GB	512	256	M1 base

Flash Attention

Flash Attention reduces KV cache memory from O(n²) to O(n) and speeds up self-attention by ~20%.

Enable/Disable

# Enable (default on Metal)
export RCLI_LLM_FLASH_ATTN=1

# Disable
export RCLI_LLM_FLASH_ATTN=0

Performance Impact

Configuration	KV Memory (8K ctx)	Attention Speed
Standard Attention	512 MB	1.0x
Flash Attention	256 MB	1.2x

Flash Attention is automatically enabled on macOS with Metal GPU. Disable only if debugging attention issues.

Memory Locking (mlock)

Pins model weights in RAM to prevent swapping to disk.

Configuration

# Enable (default on systems with >=16GB RAM)
export RCLI_LLM_USE_MLOCK=1

# Disable
export RCLI_LLM_USE_MLOCK=0

Impact

Pros: Eliminates disk I/O latency from swap
Cons: Reduces available RAM for other apps

On systems with <16GB RAM, mlock can cause OOM errors. RCLI auto-disables mlock on low-RAM systems.

RAG Configuration

Vector Search Parameters

export RCLI_RAG_VECTOR_CANDIDATES=20  # HNSW candidates
export RCLI_RAG_HNSW_EF=50            # HNSW search effort
export RCLI_RAG_HNSW_M=16             # HNSW graph connectivity

Higher values = better accuracy, slower search.

BM25 Parameters

export RCLI_RAG_BM25_CANDIDATES=20  # BM25 top-k
export RCLI_RAG_BM25_K1=1.2         # Term saturation
export RCLI_RAG_BM25_B=0.75         # Length normalization

Hybrid Fusion

export RCLI_RAG_RRF_K=60.0  # Reciprocal Rank Fusion parameter

RRF formula:

score(doc) = Σ 1 / (k + rank_source(doc))

Lower k = stronger fusion. Typical range: 10-100.

Embedding Cache

export RCLI_RAG_CACHE_SIZE_MB=256  # LRU cache size

Caches query embeddings to avoid recomputation. 256 MB stores ~650K embeddings (384-dim float32).

Audio Configuration

Sample Rates

export RCLI_STT_SAMPLE_RATE=16000   # STT (16 kHz standard)
export RCLI_TTS_SAMPLE_RATE=22050   # TTS (22.05 kHz Piper default)

Buffer Sizes

export RCLI_AUDIO_CAPTURE_BUFFER=16384   # Capture ring buffer (samples)
export RCLI_AUDIO_PLAYBACK_BUFFER=44032  # Playback ring buffer (samples)

Larger buffers = more latency tolerance, higher memory.

Silence Detection

export RCLI_STT_SILENCE_MS=800  # End-of-speech silence threshold

Shorter values = faster response, more false positives.

System Prompt

Customize the LLM’s system prompt:

export RCLI_SYSTEM_PROMPT="You are a helpful voice assistant specialized in macOS automation."

Or edit ~/Library/RCLI/config/user_config.json:

{
  "system_prompt": "You are a concise assistant. Respond in 1-2 sentences."
}

Logging

Log Levels

export RCLI_LOG_LEVEL=DEBUG  # DEBUG, INFO, WARN, ERROR

Verbose Mode

rcli -v                # Enable verbose logging
rcli --verbose         # Same

Outputs:

STT transcription results
LLM token generation stats
TTS synthesis latency
Tool call traces
Memory pool utilization

Log Files

Logs are written to stderr. Redirect to file:

rcli 2> rcli.log

Advanced Tuning

Sentence Detection

Customize sentence boundary detection:

SentenceDetector detector(
    callback,
    min_words=3,          // Min words for primary break (. ! ?)
    max_words_sec=25,     // Max words before secondary break (; :)
    word_flush=7          // Flush at N words if no punctuation
);

See src/pipeline/sentence_detector.cpp:6.

VAD Threshold

Adjust voice activity detection sensitivity:

export RCLI_VAD_THRESHOLD=0.5  # 0.0-1.0 (higher = more strict)

Energy Floor

Minimum RMS energy to feed audio to STT:

constexpr float ENERGY_FLOOR = 0.005f;  // RMS threshold

See src/pipeline/orchestrator.cpp:778.

KV Cache Reuse

Disable KV cache reuse (forces full prompt re-eval):

export RCLI_LLM_NO_CACHE=1

Disabling KV cache increases time-to-first-token by 2-3x. Only disable for debugging.

Configuration Precedence

CLI flags (highest priority)
```
rcli --gpu-layers 20 --ctx-size 8192
```
Environment variables
```
export RCLI_LLM_GPU_LAYERS=20
```
Config files (~/Library/RCLI/config/*.json)
Hardware auto-detection (lowest priority)

Next Steps

Performance

Benchmark results and optimization techniques

Troubleshooting

Common issues, debugging, and logs

Get Started

Core Features

Commands

Models

Actions

Advanced

Development

​Configuration Files

​Directory Structure

​active_models.json

​action_states.json

​user_config.json

​Environment Variables

​Model Paths

​LLM Configuration

​STT Configuration

​TTS Configuration

​RAG Configuration

​Memory Pool

​Logging

​GPU Layers

​Auto-Detection (Default)

​Manual Override

​Performance Impact

​Context Size

​Supported Context Sizes

​Configuration

​Memory Usage by Context Size

​Thread Configuration

​Decode Threads

​Prompt Eval Threads

​Batch Configuration

​Batch Size

​Micro-Batch Size

​Recommended Batch Sizes

​Flash Attention

​Enable/Disable

​Performance Impact

​Memory Locking (mlock)

​Configuration

​Impact

​RAG Configuration

​Vector Search Parameters

​BM25 Parameters

​Hybrid Fusion

​Embedding Cache

​Audio Configuration

​Sample Rates

​Buffer Sizes

​Silence Detection

​System Prompt

​Logging

​Log Levels

​Verbose Mode

​Log Files

​Advanced Tuning

​Sentence Detection

​VAD Threshold

​Energy Floor

​KV Cache Reuse

​Configuration Precedence

​Next Steps

Performance

Troubleshooting

Build docs developers (and LLMs) love

Configuration Files

Directory Structure

active_models.json

action_states.json

user_config.json

Environment Variables

Model Paths

LLM Configuration

STT Configuration

TTS Configuration

RAG Configuration

Memory Pool

Logging

GPU Layers

Auto-Detection (Default)

Manual Override

Performance Impact

Context Size

Supported Context Sizes

Configuration

Memory Usage by Context Size

Thread Configuration

Decode Threads

Prompt Eval Threads

Batch Configuration

Batch Size

Micro-Batch Size

Recommended Batch Sizes

Flash Attention

Enable/Disable

Performance Impact

Memory Locking (mlock)

Configuration

Impact

RAG Configuration

Vector Search Parameters

BM25 Parameters

Hybrid Fusion

Embedding Cache

Audio Configuration

Sample Rates

Buffer Sizes

Silence Detection

System Prompt

Logging

Log Levels

Verbose Mode

Log Files

Advanced Tuning

Sentence Detection

VAD Threshold

Energy Floor

KV Cache Reuse

Configuration Precedence

Next Steps