Skip to main content
All hyperparameters are set via environment variables and read by the Hyperparameters class at startup. The default values reflect the simple baseline configuration: 9 transformer blocks at width 512, 8 attention heads with 4 KV heads (GQA), 2x MLP expansion, vocab size 1024, sequence length 1024, tied embeddings, and a ~10-minute wallclock cap.

Data Paths

These parameters tell the training script where to find the tokenized dataset shards and the SentencePiece tokenizer model.
ParameterEnv VarDefaultDescription
data_pathDATA_PATH./data/datasets/fineweb10B_sp1024Root directory for dataset shards. Train and val glob patterns are derived from this path.
tokenizer_pathTOKENIZER_PATH./data/tokenizers/fineweb_1024_bpe.modelPath to the SentencePiece .model file used to compute BPB during validation. Must match VOCAB_SIZE.
run_idRUN_IDRandom UUIDIdentifier for this run. Used as the log filename under logs/.
seedSEED1337Global random seed applied to Python, NumPy, and PyTorch RNGs before training begins.
train_files and val_files are derived from data_path using the glob patterns fineweb_train_*.bin and fineweb_val_*.bin respectively. They are not independently configurable via environment variables.

Validation

Validation always runs on the full fineweb_val_* split (the fixed first-50k-document set). These parameters control how often validation is computed and how many tokens are processed per validation pass.
ParameterEnv VarDefaultDescription
val_batch_sizeVAL_BATCH_SIZE524288Total token budget across all ranks for each validation batch. Must supply at least one full sequence per rank.
val_loss_everyVAL_LOSS_EVERY1000Run validation every N training steps. Set to 0 to skip periodic validation (final eval still runs).
train_log_everyTRAIN_LOG_EVERY200Print a training loss log line every N steps. Steps 1–10 are always logged.

Training Length

These parameters jointly determine how long training runs. The wallclock cap (MAX_WALLCLOCK_SECONDS) takes effect as soon as the elapsed training time exceeds the threshold, triggering an early stop after the current step completes.
ParameterEnv VarDefaultDescription
iterationsITERATIONS20000Maximum number of gradient update steps. Training stops when this is reached or the wallclock cap fires, whichever comes first.
warmdown_itersWARMDOWN_ITERS1200Number of steps (or equivalent wallclock time) over which the learning rate linearly decays to zero at the end of training.
warmup_stepsWARMUP_STEPS20Number of “warmup” steps that prime the compiled forward/backward/optimizer kernels before actual training. Model and optimizer state are reset after warmup.
train_batch_tokensTRAIN_BATCH_TOKENS524288Total tokens consumed per gradient update across all ranks. Gradient accumulation steps are derived as 8 // world_size.
train_seq_lenTRAIN_SEQ_LEN1024Sequence length used for both training and validation batches.
max_wallclock_secondsMAX_WALLCLOCK_SECONDS600.0Hard cap on training time in seconds. Set to 0 to disable and let ITERATIONS alone determine when training ends.
qk_gain_initQK_GAIN_INIT1.5Initial value for the per-head learnable q_gain parameter in attention. Scales query vectors before the dot product.
For a quick smoke test, override MAX_WALLCLOCK_SECONDS=60 ITERATIONS=500 VAL_LOSS_EVERY=0 so training exits quickly and only prints the final validation metrics.

Model Shape

These parameters define the architecture of the GPT model. Changing them affects both model quality and the number of parameters, which directly impacts the compressed artifact size.
ParameterEnv VarDefaultDescription
vocab_sizeVOCAB_SIZE1024Vocabulary size. Must exactly match the SentencePiece tokenizer’s vocab size, or training will raise an error.
num_layersNUM_LAYERS9Total number of transformer blocks. Split evenly into encoder and decoder halves for the U-Net-style skip connections.
num_kv_headsNUM_KV_HEADS4Number of key/value heads for Grouped Query Attention (GQA). Must divide NUM_HEADS.
model_dimMODEL_DIM512Hidden dimension (embedding width) of the model. Must be divisible by NUM_HEADS.
num_headsNUM_HEADS8Number of query attention heads. Head dimension is MODEL_DIM // NUM_HEADS.
mlp_multMLP_MULT2MLP hidden-layer multiplier. The feedforward hidden size is MLP_MULT * MODEL_DIM. Uses a relu² activation.
tie_embeddingsTIE_EMBEDDINGS1 (true)Whether to tie the input token embedding and the output projection weight. Set to 0 to use a separate lm_head.
rope_baseROPE_BASE10000.0Base frequency for Rotary Position Embeddings (RoPE).
logit_softcapLOGIT_SOFTCAP30.0Logit soft-cap value. Logits are passed through softcap * tanh(logits / softcap) before the cross-entropy loss. Must be positive.
If TIE_EMBEDDINGS=0, a separate lm_head is initialized with zeros and trained with HEAD_LR via Adam. With tied embeddings, the token embedding is trained with TIED_EMBED_LR instead of EMBED_LR.

Optimizer

The training script uses a mixed optimizer strategy:
  • Token embedding (and tied lm_head): Adam with TIED_EMBED_LR or EMBED_LR
  • Untied lm_head (when TIE_EMBEDDINGS=0): Adam with HEAD_LR
  • 2D matrix parameters in transformer blocks: Muon with MATRIX_LR
  • Scalars and vectors in transformer blocks: Adam with SCALAR_LR
ParameterEnv VarDefaultDescription
embed_lrEMBED_LR0.6Adam learning rate for the token embedding when TIE_EMBEDDINGS=0.
head_lrHEAD_LR0.008Adam learning rate for the untied lm_head when TIE_EMBEDDINGS=0.
tied_embed_lrTIED_EMBED_LR0.05Adam learning rate for the token embedding when TIE_EMBEDDINGS=1.
tied_embed_init_stdTIED_EMBED_INIT_STD0.005Standard deviation for normal initialization of the tied embedding weight.
matrix_lrMATRIX_LR0.04Muon learning rate for 2D (matrix) parameters in transformer blocks.
scalar_lrSCALAR_LR0.04Adam learning rate for scalar/vector parameters (norms, scales, biases) in transformer blocks.
muon_momentumMUON_MOMENTUM0.95Steady-state momentum for the Muon optimizer.
muon_backend_stepsMUON_BACKEND_STEPS5Number of Newton-Schulz iterations used to orthogonalize gradients in Muon.
muon_momentum_warmup_startMUON_MOMENTUM_WARMUP_START0.85Initial Muon momentum value at step 0, linearly warmed up to MUON_MOMENTUM.
muon_momentum_warmup_stepsMUON_MOMENTUM_WARMUP_STEPS500Number of steps over which Muon momentum is linearly warmed from MUON_MOMENTUM_WARMUP_START to MUON_MOMENTUM.
beta1BETA10.9Adam β₁ (first-moment decay). Shared by all Adam optimizer groups.
beta2BETA20.95Adam β₂ (second-moment decay). Shared by all Adam optimizer groups.
adam_epsADAM_EPS1e-8Adam numerical stability epsilon. Shared by all Adam optimizer groups.
grad_clip_normGRAD_CLIP_NORM0.0Global gradient norm clip threshold. Set to 0.0 to disable gradient clipping.
All learning rates are scaled by the same lr_mul schedule factor, which ramps down to zero during warmdown. The base_lr for each parameter group stores the unscaled learning rate so the schedule can be applied multiplicatively each step.