Skip to main content
The CLI chat interface provides an interactive terminal-based way to chat with your trained NanoChat models.

Basic Usage

Run the chat interface with default settings:
python -m scripts.chat_cli
This loads the most recent SFT model and starts an interactive chat session.

Command-Line Options

Model Selection

# Load from SFT (default) or RL training
python -m scripts.chat_cli -i sft
python -m scripts.chat_cli -i rl

# Load a specific model tag
python -m scripts.chat_cli -g my-model-v2

# Load from a specific training step
python -m scripts.chat_cli -s 10000

Generation Parameters

# Set temperature (default: 0.6)
python -m scripts.chat_cli -t 0.8

# Set top-k sampling (default: 50)
python -m scripts.chat_cli -k 100

Device Configuration

# Auto-detect device (default)
python -m scripts.chat_cli

# Force specific device
python -m scripts.chat_cli --device-type cuda
python -m scripts.chat_cli --device-type cpu
python -m scripts.chat_cli --device-type mps

# Set precision (default: bfloat16)
python -m scripts.chat_cli -d float32
python -m scripts.chat_cli -d bfloat16

Single Prompt Mode

Get a single response without interactive mode:
python -m scripts.chat_cli -p "What is the capital of France?"
This runs the model once and exits after generating the response.

Interactive Commands

When running in interactive mode:
CommandDescription
quit or exitEnd the conversation and exit
clearStart a new conversation (clears history)

Complete Examples

Standard Chat Session

python -m scripts.chat_cli

NanoChat Interactive Mode
--------------------------------------------------
Type 'quit' or 'exit' to end the conversation
Type 'clear' to start a new conversation
--------------------------------------------------

User: What is machine learning?

Assistant: Machine learning is a subset of artificial intelligence...

User: clear
Conversation cleared.

User: Tell me a joke

Assistant: Why did the programmer quit his job?...

User: exit
Goodbye!

High Temperature Creative Mode

python -m scripts.chat_cli -t 1.0 -k 100
Higher temperature (1.0) and top-k (100) for more creative, diverse responses.

Low Temperature Deterministic Mode

python -m scripts.chat_cli -t 0.1 -k 20
Lower temperature (0.1) and top-k (20) for more focused, deterministic responses.

Load Specific RL Model

python -m scripts.chat_cli -i rl -g reward-tuned -s 5000
Load the RL model with tag “reward-tuned” at step 5000.

Technical Details

Conversation Format

The CLI maintains conversation state using special tokens:
  • <|user_start|> and <|user_end|> wrap user messages
  • <|assistant_start|> and <|assistant_end|> wrap assistant responses
  • Conversation begins with BOS token
From scripts/chat_cli.py:47-101:
conversation_tokens = [bos]

while True:
    user_input = input("\nUser: ").strip()
    
    # Add User message to the conversation
    conversation_tokens.append(user_start)
    conversation_tokens.extend(tokenizer.encode(user_input))
    conversation_tokens.append(user_end)
    
    # Kick off the assistant
    conversation_tokens.append(assistant_start)
    generate_kwargs = {
        "num_samples": 1,
        "max_tokens": 256,
        "temperature": args.temperature,
        "top_k": args.top_k,
    }
    response_tokens = []
    with autocast_ctx:
        for token_column, token_masks in engine.generate(conversation_tokens, **generate_kwargs):
            token = token_column[0]
            response_tokens.append(token)
            token_text = tokenizer.decode([token])
            print(token_text, end="", flush=True)
    
    if response_tokens[-1] != assistant_end:
        response_tokens.append(assistant_end)
    conversation_tokens.extend(response_tokens)

KV Cache Efficiency

The CLI uses the Engine with KV caching for efficient inference. Each new token is generated in O(1) time relative to conversation length, rather than re-processing the entire history.

All Flags Reference

FlagShortTypeDefaultDescription
--source-istrsftModel source: sft or rl
--model-tag-gstrNoneSpecific model tag to load
--step-sintNoneTraining step to load
--prompt-pstr''Single prompt mode (non-interactive)
--temperature-tfloat0.6Sampling temperature
--top-k-kint50Top-k sampling parameter
--device-typestrautoDevice: cuda, cpu, or mps
--dtype-dstrbfloat16Precision: float32 or bfloat16

Build docs developers (and LLMs) love