Usage
Parameters
Logging
Weights & Biases run name. Use
'dummy' to disable wandb logging.Runtime
Device type:
cuda, cpu, or mps. Empty string enables autodetection.Model Loading
Model tag to load from base checkpoints (e.g.
d24).Model step to load. If not specified, loads the last checkpoint.
Warm-start optimizer from pretrained checkpoint.
0 = no, 1 = yes.Training Horizon
Number of optimization steps.
-1 = full epoch through the training dataset.Batch Sizes
Defaults are inherited from the pretrained checkpoint if not specified.Maximum context length. Default: inherit from pretrain.
Per-device batch size. Default: inherit from pretrain.
Total batch size in tokens. Default: inherit from pretrain.
Optimization
Defaults are inherited from the pretrained checkpoint if not specified.Learning rate for embedding parameters (Adam). Default: inherit from pretrain.
Learning rate for unembedding parameters (Adam). Default: inherit from pretrain.
Learning rate for matrix parameters (Muon). Default: inherit from pretrain.
Initial learning rate as fraction of base learning rate.
Ratio of iterations for learning rate warmup.
Ratio of iterations for learning rate warmdown.
Final learning rate as fraction of initial learning rate.
Evaluation
Evaluate validation bits-per-byte every N steps.
-1 = disabled.Number of tokens to evaluate validation loss on (default: 40*524288).
Evaluate ChatCORE metric every N steps.
-1 = disabled.Maximum problems per categorical task for ChatCORE.
-1 = all problems.Maximum problems per generative task for ChatCORE.
Data Mixture
Number of epochs of MMLU in training mixture (teaches multiple choice).
Number of epochs of GSM8K in training mixture (teaches math and tool use).
Training Mixture
The SFT script uses a carefully balanced mixture of tasks:- SmolTalk (460K rows): General conversations
- Identity Conversations (1K rows × 2 epochs): Synthetic identity conversations
- MMLU (100K rows ×
--mmlu-epochs): Multiple choice questions - GSM8K (8K rows ×
--gsm8k-epochs): Math word problems with tool use - Simple Spelling (200K rows): Basic spelling tasks
- Spelling Bee (80K rows): Character counting tasks
Examples
Basic SFT Training
Custom Data Mixture
Override Learning Rate
Fixed Number of Iterations
ChatCORE Metric
The ChatCORE metric evaluates the chat model across 6 tasks:- ARC-Easy (categorical)
- ARC-Challenge (categorical)
- MMLU (categorical)
- GSM8K (generative)
- HumanEval (generative)
- SpellingBee (generative)