Hyperparameters

All hyperparameters are set via environment variables and read by the Hyperparameters class at startup. The default values reflect the simple baseline configuration: 9 transformer blocks at width 512, 8 attention heads with 4 KV heads (GQA), 2x MLP expansion, vocab size 1024, sequence length 1024, tied embeddings, and a ~10-minute wallclock cap.

Data Paths

These parameters tell the training script where to find the tokenized dataset shards and the SentencePiece tokenizer model.

Parameter	Env Var	Default	Description
`data_path`	`DATA_PATH`	`./data/datasets/fineweb10B_sp1024`	Root directory for dataset shards. Train and val glob patterns are derived from this path.
`tokenizer_path`	`TOKENIZER_PATH`	`./data/tokenizers/fineweb_1024_bpe.model`	Path to the SentencePiece `.model` file used to compute BPB during validation. Must match `VOCAB_SIZE`.
`run_id`	`RUN_ID`	Random UUID	Identifier for this run. Used as the log filename under `logs/`.
`seed`	`SEED`	`1337`	Global random seed applied to Python, NumPy, and PyTorch RNGs before training begins.

train_files and val_files are derived from data_path using the glob patterns fineweb_train_*.bin and fineweb_val_*.bin respectively. They are not independently configurable via environment variables.

Validation

Validation always runs on the full fineweb_val_* split (the fixed first-50k-document set). These parameters control how often validation is computed and how many tokens are processed per validation pass.

Parameter	Env Var	Default	Description
`val_batch_size`	`VAL_BATCH_SIZE`	`524288`	Total token budget across all ranks for each validation batch. Must supply at least one full sequence per rank.
`val_loss_every`	`VAL_LOSS_EVERY`	`1000`	Run validation every N training steps. Set to `0` to skip periodic validation (final eval still runs).
`train_log_every`	`TRAIN_LOG_EVERY`	`200`	Print a training loss log line every N steps. Steps 1–10 are always logged.

Training Length

These parameters jointly determine how long training runs. The wallclock cap (MAX_WALLCLOCK_SECONDS) takes effect as soon as the elapsed training time exceeds the threshold, triggering an early stop after the current step completes.

Parameter	Env Var	Default	Description
`iterations`	`ITERATIONS`	`20000`	Maximum number of gradient update steps. Training stops when this is reached or the wallclock cap fires, whichever comes first.
`warmdown_iters`	`WARMDOWN_ITERS`	`1200`	Number of steps (or equivalent wallclock time) over which the learning rate linearly decays to zero at the end of training.
`warmup_steps`	`WARMUP_STEPS`	`20`	Number of “warmup” steps that prime the compiled forward/backward/optimizer kernels before actual training. Model and optimizer state are reset after warmup.
`train_batch_tokens`	`TRAIN_BATCH_TOKENS`	`524288`	Total tokens consumed per gradient update across all ranks. Gradient accumulation steps are derived as `8 // world_size`.
`train_seq_len`	`TRAIN_SEQ_LEN`	`1024`	Sequence length used for both training and validation batches.
`max_wallclock_seconds`	`MAX_WALLCLOCK_SECONDS`	`600.0`	Hard cap on training time in seconds. Set to `0` to disable and let `ITERATIONS` alone determine when training ends.
`qk_gain_init`	`QK_GAIN_INIT`	`1.5`	Initial value for the per-head learnable `q_gain` parameter in attention. Scales query vectors before the dot product.

For a quick smoke test, override MAX_WALLCLOCK_SECONDS=60 ITERATIONS=500 VAL_LOSS_EVERY=0 so training exits quickly and only prints the final validation metrics.

Model Shape

These parameters define the architecture of the GPT model. Changing them affects both model quality and the number of parameters, which directly impacts the compressed artifact size.

Parameter	Env Var	Default	Description
`vocab_size`	`VOCAB_SIZE`	`1024`	Vocabulary size. Must exactly match the SentencePiece tokenizer’s vocab size, or training will raise an error.
`num_layers`	`NUM_LAYERS`	`9`	Total number of transformer blocks. Split evenly into encoder and decoder halves for the U-Net-style skip connections.
`num_kv_heads`	`NUM_KV_HEADS`	`4`	Number of key/value heads for Grouped Query Attention (GQA). Must divide `NUM_HEADS`.
`model_dim`	`MODEL_DIM`	`512`	Hidden dimension (embedding width) of the model. Must be divisible by `NUM_HEADS`.
`num_heads`	`NUM_HEADS`	`8`	Number of query attention heads. Head dimension is `MODEL_DIM // NUM_HEADS`.
`mlp_mult`	`MLP_MULT`	`2`	MLP hidden-layer multiplier. The feedforward hidden size is `MLP_MULT * MODEL_DIM`. Uses a relu² activation.
`tie_embeddings`	`TIE_EMBEDDINGS`	`1` (true)	Whether to tie the input token embedding and the output projection weight. Set to `0` to use a separate `lm_head`.
`rope_base`	`ROPE_BASE`	`10000.0`	Base frequency for Rotary Position Embeddings (RoPE).
`logit_softcap`	`LOGIT_SOFTCAP`	`30.0`	Logit soft-cap value. Logits are passed through `softcap * tanh(logits / softcap)` before the cross-entropy loss. Must be positive.

If TIE_EMBEDDINGS=0, a separate lm_head is initialized with zeros and trained with HEAD_LR via Adam. With tied embeddings, the token embedding is trained with TIED_EMBED_LR instead of EMBED_LR.

Optimizer

The training script uses a mixed optimizer strategy:

Token embedding (and tied lm_head): Adam with TIED_EMBED_LR or EMBED_LR
Untied lm_head (when TIE_EMBEDDINGS=0): Adam with HEAD_LR
2D matrix parameters in transformer blocks: Muon with MATRIX_LR
Scalars and vectors in transformer blocks: Adam with SCALAR_LR

Parameter	Env Var	Default	Description
`embed_lr`	`EMBED_LR`	`0.6`	Adam learning rate for the token embedding when `TIE_EMBEDDINGS=0`.
`head_lr`	`HEAD_LR`	`0.008`	Adam learning rate for the untied `lm_head` when `TIE_EMBEDDINGS=0`.
`tied_embed_lr`	`TIED_EMBED_LR`	`0.05`	Adam learning rate for the token embedding when `TIE_EMBEDDINGS=1`.
`tied_embed_init_std`	`TIED_EMBED_INIT_STD`	`0.005`	Standard deviation for normal initialization of the tied embedding weight.
`matrix_lr`	`MATRIX_LR`	`0.04`	Muon learning rate for 2D (matrix) parameters in transformer blocks.
`scalar_lr`	`SCALAR_LR`	`0.04`	Adam learning rate for scalar/vector parameters (norms, scales, biases) in transformer blocks.
`muon_momentum`	`MUON_MOMENTUM`	`0.95`	Steady-state momentum for the Muon optimizer.
`muon_backend_steps`	`MUON_BACKEND_STEPS`	`5`	Number of Newton-Schulz iterations used to orthogonalize gradients in Muon.
`muon_momentum_warmup_start`	`MUON_MOMENTUM_WARMUP_START`	`0.85`	Initial Muon momentum value at step 0, linearly warmed up to `MUON_MOMENTUM`.
`muon_momentum_warmup_steps`	`MUON_MOMENTUM_WARMUP_STEPS`	`500`	Number of steps over which Muon momentum is linearly warmed from `MUON_MOMENTUM_WARMUP_START` to `MUON_MOMENTUM`.
`beta1`	`BETA1`	`0.9`	Adam β₁ (first-moment decay). Shared by all Adam optimizer groups.
`beta2`	`BETA2`	`0.95`	Adam β₂ (second-moment decay). Shared by all Adam optimizer groups.
`adam_eps`	`ADAM_EPS`	`1e-8`	Adam numerical stability epsilon. Shared by all Adam optimizer groups.
`grad_clip_norm`	`GRAD_CLIP_NORM`	`0.0`	Global gradient norm clip threshold. Set to `0.0` to disable gradient clipping.

All learning rates are scaled by the same lr_mul schedule factor, which ramps down to zero during warmdown. The base_lr for each parameter group stores the unscaled learning rate so the schedule can be applied multiplicatively each step.

Overview

Getting Started

Concepts

Submission Guide

Reference

Data Paths

Validation

Training Length

Model Shape

Optimizer

Overview

Getting Started

Concepts

Submission Guide

Reference

​Data Paths

​Validation

​Training Length

​Model Shape

​Optimizer

Data Paths

Validation

Training Length

Model Shape

Optimizer