CLI Chat Interface

The CLI chat interface provides an interactive terminal-based way to chat with your trained NanoChat models.

Basic Usage

Run the chat interface with default settings:

python -m scripts.chat_cli

This loads the most recent SFT model and starts an interactive chat session.

Command-Line Options

Model Selection

# Load from SFT (default) or RL training
python -m scripts.chat_cli -i sft
python -m scripts.chat_cli -i rl

# Load a specific model tag
python -m scripts.chat_cli -g my-model-v2

# Load from a specific training step
python -m scripts.chat_cli -s 10000

Generation Parameters

# Set temperature (default: 0.6)
python -m scripts.chat_cli -t 0.8

# Set top-k sampling (default: 50)
python -m scripts.chat_cli -k 100

Device Configuration

# Auto-detect device (default)
python -m scripts.chat_cli

# Force specific device
python -m scripts.chat_cli --device-type cuda
python -m scripts.chat_cli --device-type cpu
python -m scripts.chat_cli --device-type mps

# Set precision (default: bfloat16)
python -m scripts.chat_cli -d float32
python -m scripts.chat_cli -d bfloat16

Single Prompt Mode

Get a single response without interactive mode:

python -m scripts.chat_cli -p "What is the capital of France?"

This runs the model once and exits after generating the response.

Interactive Commands

When running in interactive mode:

Command	Description
`quit` or `exit`	End the conversation and exit
`clear`	Start a new conversation (clears history)

Complete Examples

Standard Chat Session

python -m scripts.chat_cli

NanoChat Interactive Mode
--------------------------------------------------
Type 'quit' or 'exit' to end the conversation
Type 'clear' to start a new conversation
--------------------------------------------------

User: What is machine learning?

Assistant: Machine learning is a subset of artificial intelligence...

User: clear
Conversation cleared.

User: Tell me a joke

Assistant: Why did the programmer quit his job?...

User: exit
Goodbye!

High Temperature Creative Mode

python -m scripts.chat_cli -t 1.0 -k 100

Higher temperature (1.0) and top-k (100) for more creative, diverse responses.

Low Temperature Deterministic Mode

python -m scripts.chat_cli -t 0.1 -k 20

Lower temperature (0.1) and top-k (20) for more focused, deterministic responses.

Load Specific RL Model

python -m scripts.chat_cli -i rl -g reward-tuned -s 5000

Load the RL model with tag “reward-tuned” at step 5000.

Technical Details

Conversation Format

The CLI maintains conversation state using special tokens:

<|user_start|> and <|user_end|> wrap user messages
<|assistant_start|> and <|assistant_end|> wrap assistant responses
Conversation begins with BOS token

From scripts/chat_cli.py:47-101:

conversation_tokens = [bos]

while True:
    user_input = input("\nUser: ").strip()
    
    # Add User message to the conversation
    conversation_tokens.append(user_start)
    conversation_tokens.extend(tokenizer.encode(user_input))
    conversation_tokens.append(user_end)
    
    # Kick off the assistant
    conversation_tokens.append(assistant_start)
    generate_kwargs = {
        "num_samples": 1,
        "max_tokens": 256,
        "temperature": args.temperature,
        "top_k": args.top_k,
    }
    response_tokens = []
    with autocast_ctx:
        for token_column, token_masks in engine.generate(conversation_tokens, **generate_kwargs):
            token = token_column[0]
            response_tokens.append(token)
            token_text = tokenizer.decode([token])
            print(token_text, end="", flush=True)
    
    if response_tokens[-1] != assistant_end:
        response_tokens.append(assistant_end)
    conversation_tokens.extend(response_tokens)

KV Cache Efficiency

The CLI uses the Engine with KV caching for efficient inference. Each new token is generated in O(1) time relative to conversation length, rather than re-processing the entire history.

All Flags Reference

Flag	Short	Type	Default	Description
`--source`	`-i`	str	`sft`	Model source: `sft` or `rl`
`--model-tag`	`-g`	str	None	Specific model tag to load
`--step`	`-s`	int	None	Training step to load
`--prompt`	`-p`	str	`''`	Single prompt mode (non-interactive)
`--temperature`	`-t`	float	`0.6`	Sampling temperature
`--top-k`	`-k`	int	`50`	Top-k sampling parameter
`--device-type`		str	auto	Device: `cuda`, `cpu`, or `mps`
`--dtype`	`-d`	str	`bfloat16`	Precision: `float32` or `bfloat16`

Get Started

Training

Evaluation

Inference

Architecture

Advanced

CLI Chat Interface

Basic Usage

Command-Line Options

Model Selection

Generation Parameters

Device Configuration

Single Prompt Mode

Interactive Commands

Complete Examples

Standard Chat Session

High Temperature Creative Mode

Low Temperature Deterministic Mode

Load Specific RL Model

Technical Details

Conversation Format

KV Cache Efficiency

All Flags Reference

Build docs developers (and LLMs) love

Get Started

Training

Evaluation

Inference

Architecture

Advanced

​Basic Usage

​Command-Line Options

​Model Selection

​Generation Parameters

​Device Configuration

​Single Prompt Mode

​Interactive Commands

​Complete Examples

​Standard Chat Session

​High Temperature Creative Mode

​Low Temperature Deterministic Mode

​Load Specific RL Model

​Technical Details

​Conversation Format

​KV Cache Efficiency

​All Flags Reference

Build docs developers (and LLMs) love

Basic Usage

Command-Line Options

Model Selection

Generation Parameters

Device Configuration

Single Prompt Mode

Interactive Commands

Complete Examples

Standard Chat Session

High Temperature Creative Mode

Low Temperature Deterministic Mode

Load Specific RL Model

Technical Details

Conversation Format

KV Cache Efficiency

All Flags Reference