Usage
Parameters
Logging
Weights & Biases run name. Use
'dummy' to disable wandb logging.Runtime
Device type:
cuda, cpu, or mps. Empty string enables autodetection.FP8 Training
Enable FP8 training. Requires H100+ GPU and torchao.
FP8 scaling recipe:
tensorwise (faster, recommended) or rowwise (more accurate but slower).Model Architecture
Depth of the Transformer model (number of layers).
Model dimension is calculated as
depth * aspect_ratio.Target head dimension for attention.
Maximum context length (sequence length).
Sliding window pattern tiled across layers.
L = full context, S = half context (e.g. 'SSL').Training Horizon
Only one is used, in order of precedence:Explicit number of optimization steps.
-1 = disabled.Calculate num_iterations to reach target FLOPs.
-1 = disabled.Calculate num_iterations to maintain data:param ratio. Chinchilla = 20.
-1 = disabled.Optimization
Per-device batch size. Reduce to 16, 8, 4, etc. if you encounter OOM errors.
Total batch size in tokens. Good value: 524288.
-1 = auto-compute optimal.Learning rate for embedding parameters (Adam).
Learning rate for unembedding parameters (Adam).
Cautious weight decay for the Muon optimizer (for weights).
Learning rate for matrix parameters (Muon).
Learning rate for scalars (resid_lambdas, x0_lambdas).
Adam beta1 for embedding/unembedding.
Adam beta2 for embedding/unembedding.
Ratio of iterations for learning rate warmup.
Ratio of iterations for learning rate warmdown.
Final learning rate as fraction of initial learning rate.
Resume training from this step.
-1 = disabled.Evaluation
Evaluate validation bits-per-byte every N steps.
-1 = disabled.Number of tokens to evaluate validation loss on (default: 40*524288).
Evaluate CORE metric every N steps.
-1 = disabled.Maximum examples per task for CORE metric.
Sample from model every N steps.
-1 = disabled.Save checkpoints every N steps.
-1 = only save at end.Output
Override model tag for checkpoint directory name. If not provided, defaults to
d{depth} (e.g. d12).