Overview
Supervised Fine-Tuning (SFT) adapts a pretrained base model to follow instructions and engage in conversation. The SFT stage:- Loads a pretrained base model checkpoint
- Trains on a mixture of conversational datasets
- Teaches the model chat format, tool use, and task-specific skills
- Optionally warm-starts the optimizer from pretraining
Quick Start
Single GPU:Loading Pretrained Models
Model tag to load from
base_checkpoints/. If not specified, loads the default model.Specific checkpoint step to load. If not specified, loads the latest checkpoint.
Warm-start optimizer from pretrained checkpoint (1=yes, 0=no).Recommended: Keep at 1 to reuse momentum buffers from pretraining.
Training Data Mixture
The default SFT mixture includes:Data Mixture Parameters
Number of epochs of MMLU in the training mixture. MMLU teaches multiple-choice question answering.
Number of epochs of GSM8K in the training mixture. GSM8K teaches math reasoning and tool use.
Training Horizon
Number of optimization steps. -1 = train for one full epoch through the data mixture.
Batch Size
By default, SFT inherits batch size settings from the pretrained checkpoint.Maximum context length. None = inherit from pretrained checkpoint (typically 2048).
Per-device batch size. None = inherit from pretrained checkpoint (typically 32).Reduce if you encounter OOM errors.
Total batch size in tokens across all devices. None = inherit from pretrained checkpoint (typically 524288).
Learning Rates
By default, SFT inherits learning rates from pretraining and scales them down:Initial learning rate as fraction of pretrained LRs.Example: If pretraining used
matrix_lr=0.02, SFT starts at 0.02 × 0.8 = 0.016.Learning rate for transformer matrices (Muon optimizer). None = inherit from pretraining (typically 0.02).
Learning rate for input embedding (Adam). None = inherit from pretraining (typically 0.3).
Learning rate for output unembedding (Adam). None = inherit from pretraining (typically 0.004).
Learning Rate Schedule
Fraction of iterations for linear LR warmup. 0.0 = no warmup.
Fraction of iterations for linear LR warmdown. 0.5 = decay in last half of training.
Final LR as fraction of initial LR. 0.0 = decay to zero.
Evaluation
Evaluate validation bits-per-byte every N steps. -1 to disable.
Number of tokens for validation evaluation (default: 40 × 524288).
Evaluate ChatCORE metric every N steps. -1 to disable.ChatCORE measures performance on:
- ARC-Easy, ARC-Challenge (science Q&A)
- MMLU (knowledge)
- GSM8K (math reasoning)
- HumanEval (code generation)
- SpellingBee (spelling)
Max problems per categorical task (ARC, MMLU) for ChatCORE. -1 = no limit.
Max problems per generative task (GSM8K, HumanEval) for ChatCORE.
Logging
Wandb run name. Set to “dummy” to disable wandb logging.
Device type: cuda, cpu, or mps. Empty string = autodetect.
BOS-Aligned Best-Fit Packing
SFT uses a specialized dataloader that:- BOS-aligned: Each row starts with a
<|bos|>token (beginning of conversation) - Best-fit packing: Multiple conversations are packed into each sequence using a best-fit algorithm
- Padding instead of cropping: When no conversation fits, the row is padded (no tokens discarded)
- Target masking: Padding positions have targets set to -1 (ignored by loss)
- Maximum token efficiency (no wasted computation)
- No information loss (all training tokens are seen)
- Clean conversation boundaries
Example Workflows
Basic SFT on pretrained d12 model
SFT with custom data mixture
SFT with custom learning rates
SFT without optimizer warm-start
SFT with fixed number of iterations
Output
Checkpoints are saved to$NANOCHAT_BASE_DIR/chatsft_checkpoints/{model_tag}/:
step_{N}_model.pt- Model weightsstep_{N}_optimizer.pt- Optimizer statestep_{N}_meta.json- Metadata (config, validation loss, ChatCORE scores)
Monitoring
Key metrics logged to console and wandb:- train/loss - Training loss
- val/bpb - Validation bits per byte
- chatcore_metric - Overall ChatCORE score (centered mean across tasks)
- chatcore_cat - ChatCORE on categorical tasks only (ARC, MMLU)
- chatcore/[task] - Per-task accuracy (ARC-Easy, ARC-Challenge, MMLU, GSM8K, HumanEval, SpellingBee)
- train/epoch - Current epoch through the dataset
Identity Conversations
The SFT mixture includes synthetic identity conversations fromidentity_conversations.jsonl. These teach the model:
- Its name and identity
- How to respond to meta questions (“Who are you?”, “Who made you?”)
- Appropriate disclaimers and limitations
$NANOCHAT_BASE_DIR/identity_conversations.jsonl.
Optimizer Warm-Starting
When--load-optimizer=1 (default):
- Loads optimizer state from pretrained checkpoint
- Preserves momentum buffers (useful for stability)
- Resets learning rates to SFT values (ignoring pretrained LRs which were warmed down to ~0)
Weight Decay
Note: SFT uses weight_decay=0.0 because:- Pretraining already ramped weight decay to zero by end of training
- SFT continues with zero weight decay for fine-grained adaptation
- No regularization is needed on the small SFT dataset after strong pretraining