Skip to main content
All configuration in train_gpt.py is driven by environment variables. This page is a consolidated reference organized by subsystem.

Quick-Start Example

RUN_ID=my_experiment \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
NUM_LAYERS=12 \
MODEL_DIM=512 \
MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Training Hyperparameters

All variables in this section are read by the Hyperparameters class at process startup. Unset variables fall back to the listed defaults.

Data Paths

DATA_PATH
string
default:"./data/datasets/fineweb10B_sp1024"
Root directory for tokenized dataset shards. Train and val glob patterns (fineweb_train_*.bin and fineweb_val_*.bin) are derived from this path.
TOKENIZER_PATH
string
default:"./data/tokenizers/fineweb_1024_bpe.model"
Path to the SentencePiece .model file. Used to build look-up tables for the tokenizer-agnostic BPB metric. Must match VOCAB_SIZE exactly or training raises an error.
RUN_ID
string
default:"random UUID"
Human-readable identifier for this run. Determines the log filename at logs/<RUN_ID>.txt.
SEED
integer
default:"1337"
Global random seed applied to Python, NumPy, and PyTorch (including cuda.manual_seed_all) before training.

Validation

VAL_BATCH_SIZE
integer
default:"524288"
Total token budget across all ranks per validation pass. Must provide at least one full TRAIN_SEQ_LEN-length sequence per rank.
VAL_LOSS_EVERY
integer
default:"1000"
Run validation every N training steps. Set to 0 to disable periodic validation (final evaluation still runs at the end).
TRAIN_LOG_EVERY
integer
default:"200"
Log a train_loss line every N steps. Steps 1–10 are always logged regardless of this setting.

Training Length

ITERATIONS
integer
default:"20000"
Maximum number of gradient update steps before training stops. The wallclock cap may cause an earlier stop.
WARMDOWN_ITERS
integer
default:"1200"
Number of steps (or equivalent wallclock duration) over which the learning rate linearly decays to zero at training end.
WARMUP_STEPS
integer
default:"20"
Number of pre-training “warmup” steps that prime compiled kernels. Model and optimizer state are fully reset after warmup completes, so effective training always starts from the true initialization.
TRAIN_BATCH_TOKENS
integer
default:"524288"
Total tokens consumed per gradient update across all ranks. Gradient accumulation steps = 8 // WORLD_SIZE.
TRAIN_SEQ_LEN
integer
default:"1024"
Sequence length for both training and validation. Affects memory usage and the minimum VAL_BATCH_SIZE.
MAX_WALLCLOCK_SECONDS
float
default:"600.0"
Hard cap on training time in seconds. When elapsed training time reaches this limit, training stops after the current step finishes. Set to 0 to disable the cap.
QK_GAIN_INIT
float
default:"1.5"
Initial value for the per-head learnable q_gain parameter in each attention block. Scales query vectors before the dot product.

Model Shape

VOCAB_SIZE
integer
default:"1024"
Vocabulary size. Must exactly match the SentencePiece tokenizer’s vocab size.
NUM_LAYERS
integer
default:"9"
Total number of transformer blocks. Split evenly into encoder and decoder halves for U-Net-style skip connections.
NUM_KV_HEADS
integer
default:"4"
Number of key/value heads for Grouped Query Attention (GQA). Must evenly divide NUM_HEADS.
MODEL_DIM
integer
default:"512"
Hidden/embedding dimension. Must be divisible by NUM_HEADS, and MODEL_DIM // NUM_HEADS must be even (required for RoPE).
NUM_HEADS
integer
default:"8"
Number of query attention heads.
MLP_MULT
integer
default:"2"
MLP hidden-layer multiplier. The feedforward hidden size is MLP_MULT * MODEL_DIM.
TIE_EMBEDDINGS
integer
default:"1"
Set to 1 to tie input embedding and output projection weights (saves parameters). Set to 0 for a separate lm_head.
ROPE_BASE
float
default:"10000.0"
Base frequency for Rotary Position Embeddings.
LOGIT_SOFTCAP
float
default:"30.0"
Logit soft-cap. Applied as softcap * tanh(logits / softcap) before cross-entropy. Must be positive.

Optimizer

EMBED_LR
float
default:"0.6"
Adam learning rate for the token embedding when TIE_EMBEDDINGS=0.
HEAD_LR
float
default:"0.008"
Adam learning rate for the untied lm_head when TIE_EMBEDDINGS=0.
TIED_EMBED_LR
float
default:"0.05"
Adam learning rate for the token embedding when TIE_EMBEDDINGS=1.
TIED_EMBED_INIT_STD
float
default:"0.005"
Standard deviation for normal initialization of the tied embedding weight.
MATRIX_LR
float
default:"0.04"
Muon learning rate for 2D matrix parameters in transformer blocks.
SCALAR_LR
float
default:"0.04"
Adam learning rate for scalar and vector parameters (scales, norms, gains) in transformer blocks.
MUON_MOMENTUM
float
default:"0.95"
Steady-state momentum for the Muon optimizer.
MUON_BACKEND_STEPS
integer
default:"5"
Number of Newton-Schulz iterations used to orthogonalize gradient matrices in Muon.
MUON_MOMENTUM_WARMUP_START
float
default:"0.85"
Starting Muon momentum value at step 0, linearly warmed up to MUON_MOMENTUM over MUON_MOMENTUM_WARMUP_STEPS steps.
MUON_MOMENTUM_WARMUP_STEPS
integer
default:"500"
Steps over which Muon momentum is linearly warmed from MUON_MOMENTUM_WARMUP_START to MUON_MOMENTUM.
BETA1
float
default:"0.9"
Adam β₁ (first-moment decay). Applies to all Adam optimizer groups.
BETA2
float
default:"0.95"
Adam β₂ (second-moment decay). Applies to all Adam optimizer groups.
ADAM_EPS
float
default:"1e-8"
Adam numerical stability epsilon. Applies to all Adam optimizer groups.
GRAD_CLIP_NORM
float
default:"0.0"
Global gradient norm clip threshold. Set to 0.0 to disable gradient clipping.

Quantization

These variables control which tensors are kept in floating-point during int8 post-training quantization.
CONTROL_TENSOR_NAME_PATTERNS
string
Comma-separated list of name substrings. Any parameter whose name contains one of these patterns is treated as a “control tensor” — kept in fp32 during training and excluded from int8 quantization. These are typically low-dimensional scalar/vector parameters that are sensitive to precision loss.
INT8_KEEP_FLOAT_FP32_NAME_PATTERNS
string
default:"same as CONTROL_TENSOR_NAME_PATTERNS"
Comma-separated list of name substrings. Tensors matching these patterns are kept in full fp32 in the quantized artifact rather than being downcast to fp16. Defaults to the same value as CONTROL_TENSOR_NAME_PATTERNS.
Tensors with 65,536 elements or fewer are always kept as floating-point (stored as fp16) rather than quantized to int8, regardless of these patterns. Large 2D float tensors use per-row int8 quantization; other large float tensors use per-tensor int8 quantization.

Distributed Training

These variables are set automatically by torchrun. You do not need to set them manually.
RANK
integer
Global rank of the current process across all nodes. Process 0 is the master process that writes logs and saves checkpoints.
WORLD_SIZE
integer
Total number of processes in the distributed job. Must divide 8 so that gradient accumulation steps (8 // WORLD_SIZE) remain an integer. Valid values: 1, 2, 4, 8.
LOCAL_RANK
integer
Rank of the current process on its local node. Used to select the CUDA device (cuda:<LOCAL_RANK>).

Data Pipeline

These variables configure the dataset download and tokenization scripts in data/.
MATCHED_FINEWEB_REPO_ID
string
default:"willdepueoai/parameter-golf"
Hugging Face dataset repository ID to download shards and tokenizers from.
MATCHED_FINEWEB_REMOTE_ROOT_PREFIX
string
default:"datasets"
Subdirectory prefix within the HF repo under which dataset shards and manifest are stored.
MATCHED_FINEWEB_SP_BATCH_SIZE
integer
Batch size for SentencePiece tokenizer encoding during shard export. Useful for tuning CPU-heavy export throughput.
MATCHED_FINEWEB_TOKENIZER_THREADS
integer
Number of threads for the tokenizer encoding pool during shard export.
MATCHED_FINEWEB_TIKTOKEN_THREADS
integer
Number of threads for tiktoken encoding during shard export (used when tokenizing with the tiktoken backend).
MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE
integer
Batch size for GPT-2 decoding during the blobstore docs-cache path. Useful for tuning memory vs. throughput tradeoff.

MLX-Only Variables

These variables are specific to train_gpt_mlx.py and have no effect on train_gpt.py.
MLX_MAX_MICROBATCH_TOKENS
integer
default:"8192"
Maximum tokens per sub-batch within each logical microbatch. MLX splits each microbatch into smaller chunks of at most this size to reduce peak memory pressure on Apple Silicon without changing the effective optimizer batch size.
GRAD_ACCUM_STEPS
integer
default:"8"
Number of gradient accumulation steps per optimizer update in train_gpt_mlx.py. In train_gpt.py this is always derived as 8 // WORLD_SIZE and is not independently configurable.
OUT_DIR
string
default:"logs"
Output directory for log files and model artifacts in train_gpt_mlx.py. In train_gpt.py the log directory is always logs/ and is not configurable.
LOGIT_CHUNK_TOKENS
integer
default:"0"
Number of tokens per logit computation chunk in train_gpt_mlx.py. Set to a positive value to reduce peak memory by computing the final projection and cross-entropy loss in chunks. 0 (default) computes all tokens in a single matmul.