Global Options
Specify models directory path.Useful for testing different model sets or managing multiple configurations.
Load RAG index for document-grounded answers.When enabled, all queries are augmented with retrieved context from the indexed documents.
Number of LLM layers to offload to GPU.When to adjust:
- Set to
0for CPU-only testing - Reduce if experiencing memory pressure
- Default
99uses all available GPU layers (optimal for Apple Silicon)
LLM context window size (tokens).Tradeoffs:
- Larger context: more conversation history, higher memory usage
- Smaller context: faster inference, less memory
Disable TTS audio playback (text output only).Use cases:
- Silent operation (text-only responses)
- Faster benchmarking (skip TTS synthesis)
- Headless environments
Enable debug logging.Shows detailed logs from llama.cpp, sherpa-onnx, and RCLI internals.
Display help information.
Benchmark Options
These options are specific torcli bench:
Benchmark suite to run.Available suites:
all- All benchmarks (default)stt- Speech-to-textllm- Language model generationtts- Text-to-speech synthesise2e- End-to-end pipeline (voice in → audio out)tools- Tool-calling accuracy and latencyrag- RAG retrieval performancememory- Memory usage profiling
Number of measured runs per benchmark.More runs = more stable results, longer total runtime.
Export benchmark results to JSON file.Output format:
Benchmark all installed LLM models.Compares performance across all downloaded language models.
Benchmark all installed TTS voices.Compares synthesis speed and quality across all downloaded voices.
Specify LLM model for benchmark (overrides active selection).
Specify TTS voice for benchmark (overrides active selection).
Specify STT model for benchmark (overrides active selection).
Examples
Option Precedence
When the same option is specified multiple times:- Command-line flags (highest priority)
- Config file settings
- Default values (lowest priority)
Performance Tips
Optimize for Speed
Optimize for Speed
- Full GPU offload (
--gpu-layers 99) - Smaller context window (
--ctx-size 2048) - Skip TTS (
--no-speak)
Optimize for Memory
Optimize for Memory
- CPU-only inference (
--gpu-layers 0) - Small context (
--ctx-size 2048) - Use smaller models via
rcli models
Optimize for Quality
Optimize for Quality
- Large context window (
--ctx-size 8192) - RAG enabled for grounded answers
- Use larger models (Qwen3.5 4B) via
rcli upgrade-llm
Related
Commands
Complete command reference
Environment
Environment variables and config files