Basic Usage
Run the chat interface with default settings:Command-Line Options
Model Selection
Generation Parameters
Device Configuration
Single Prompt Mode
Get a single response without interactive mode:Interactive Commands
When running in interactive mode:| Command | Description |
|---|---|
quit or exit | End the conversation and exit |
clear | Start a new conversation (clears history) |
Complete Examples
Standard Chat Session
High Temperature Creative Mode
Low Temperature Deterministic Mode
Load Specific RL Model
Technical Details
Conversation Format
The CLI maintains conversation state using special tokens:<|user_start|>and<|user_end|>wrap user messages<|assistant_start|>and<|assistant_end|>wrap assistant responses- Conversation begins with BOS token
scripts/chat_cli.py:47-101:
KV Cache Efficiency
The CLI uses the Engine with KV caching for efficient inference. Each new token is generated in O(1) time relative to conversation length, rather than re-processing the entire history.All Flags Reference
| Flag | Short | Type | Default | Description |
|---|---|---|---|---|
--source | -i | str | sft | Model source: sft or rl |
--model-tag | -g | str | None | Specific model tag to load |
--step | -s | int | None | Training step to load |
--prompt | -p | str | '' | Single prompt mode (non-interactive) |
--temperature | -t | float | 0.6 | Sampling temperature |
--top-k | -k | int | 50 | Top-k sampling parameter |
--device-type | str | auto | Device: cuda, cpu, or mps | |
--dtype | -d | str | bfloat16 | Precision: float32 or bfloat16 |