Environment Variables

All configuration in train_gpt.py is driven by environment variables. This page is a consolidated reference organized by subsystem.

Quick-Start Example

RUN_ID=my_experiment \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
NUM_LAYERS=12 \
MODEL_DIM=512 \
MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Training Hyperparameters

All variables in this section are read by the Hyperparameters class at process startup. Unset variables fall back to the listed defaults.

Data Paths

DATA_PATH

string

default:"./data/datasets/fineweb10B_sp1024"

Root directory for tokenized dataset shards. Train and val glob patterns (fineweb_train_*.bin and fineweb_val_*.bin) are derived from this path.

TOKENIZER_PATH

string

default:"./data/tokenizers/fineweb_1024_bpe.model"

Path to the SentencePiece .model file. Used to build look-up tables for the tokenizer-agnostic BPB metric. Must match VOCAB_SIZE exactly or training raises an error.

RUN_ID

string

default:"random UUID"

Human-readable identifier for this run. Determines the log filename at logs/<RUN_ID>.txt.

SEED

integer

default:"1337"

Global random seed applied to Python, NumPy, and PyTorch (including cuda.manual_seed_all) before training.

Validation

VAL_BATCH_SIZE

integer

default:"524288"

Total token budget across all ranks per validation pass. Must provide at least one full TRAIN_SEQ_LEN-length sequence per rank.

VAL_LOSS_EVERY

integer

default:"1000"

Run validation every N training steps. Set to 0 to disable periodic validation (final evaluation still runs at the end).

TRAIN_LOG_EVERY

integer

default:"200"

Log a train_loss line every N steps. Steps 1–10 are always logged regardless of this setting.

Training Length

ITERATIONS

integer

default:"20000"

Maximum number of gradient update steps before training stops. The wallclock cap may cause an earlier stop.

WARMDOWN_ITERS

integer

default:"1200"

Number of steps (or equivalent wallclock duration) over which the learning rate linearly decays to zero at training end.

WARMUP_STEPS

integer

default:"20"

Number of pre-training “warmup” steps that prime compiled kernels. Model and optimizer state are fully reset after warmup completes, so effective training always starts from the true initialization.

TRAIN_BATCH_TOKENS

integer

default:"524288"

Total tokens consumed per gradient update across all ranks. Gradient accumulation steps = 8 // WORLD_SIZE.

TRAIN_SEQ_LEN

integer

default:"1024"

Sequence length for both training and validation. Affects memory usage and the minimum VAL_BATCH_SIZE.

MAX_WALLCLOCK_SECONDS

float

default:"600.0"

Hard cap on training time in seconds. When elapsed training time reaches this limit, training stops after the current step finishes. Set to 0 to disable the cap.

QK_GAIN_INIT

float

default:"1.5"

Initial value for the per-head learnable q_gain parameter in each attention block. Scales query vectors before the dot product.

Model Shape

VOCAB_SIZE

integer

default:"1024"

Vocabulary size. Must exactly match the SentencePiece tokenizer’s vocab size.

NUM_LAYERS

integer

default:"9"

Total number of transformer blocks. Split evenly into encoder and decoder halves for U-Net-style skip connections.

NUM_KV_HEADS

integer

default:"4"

Number of key/value heads for Grouped Query Attention (GQA). Must evenly divide NUM_HEADS.

MODEL_DIM

integer

default:"512"

Hidden/embedding dimension. Must be divisible by NUM_HEADS, and MODEL_DIM // NUM_HEADS must be even (required for RoPE).

NUM_HEADS

integer

default:"8"

Number of query attention heads.

MLP_MULT

integer

default:"2"

MLP hidden-layer multiplier. The feedforward hidden size is MLP_MULT * MODEL_DIM.

TIE_EMBEDDINGS

integer

default:"1"

Set to 1 to tie input embedding and output projection weights (saves parameters). Set to 0 for a separate lm_head.

ROPE_BASE

float

default:"10000.0"

Base frequency for Rotary Position Embeddings.

LOGIT_SOFTCAP

float

default:"30.0"

Logit soft-cap. Applied as softcap * tanh(logits / softcap) before cross-entropy. Must be positive.

Optimizer

EMBED_LR

float

default:"0.6"

Adam learning rate for the token embedding when TIE_EMBEDDINGS=0.

HEAD_LR

float

default:"0.008"

Adam learning rate for the untied lm_head when TIE_EMBEDDINGS=0.

TIED_EMBED_LR

float

default:"0.05"

Adam learning rate for the token embedding when TIE_EMBEDDINGS=1.

TIED_EMBED_INIT_STD

float

default:"0.005"

Standard deviation for normal initialization of the tied embedding weight.

MATRIX_LR

float

default:"0.04"

Muon learning rate for 2D matrix parameters in transformer blocks.

SCALAR_LR

float

default:"0.04"

Adam learning rate for scalar and vector parameters (scales, norms, gains) in transformer blocks.

MUON_MOMENTUM

float

default:"0.95"

Steady-state momentum for the Muon optimizer.

MUON_BACKEND_STEPS

integer

default:"5"

Number of Newton-Schulz iterations used to orthogonalize gradient matrices in Muon.

MUON_MOMENTUM_WARMUP_START

float

default:"0.85"

Starting Muon momentum value at step 0, linearly warmed up to MUON_MOMENTUM over MUON_MOMENTUM_WARMUP_STEPS steps.

MUON_MOMENTUM_WARMUP_STEPS

integer

default:"500"

Steps over which Muon momentum is linearly warmed from MUON_MOMENTUM_WARMUP_START to MUON_MOMENTUM.

BETA1

float

default:"0.9"

Adam β₁ (first-moment decay). Applies to all Adam optimizer groups.

BETA2

float

default:"0.95"

Adam β₂ (second-moment decay). Applies to all Adam optimizer groups.

ADAM_EPS

float

default:"1e-8"

Adam numerical stability epsilon. Applies to all Adam optimizer groups.

GRAD_CLIP_NORM

float

default:"0.0"

Global gradient norm clip threshold. Set to 0.0 to disable gradient clipping.

Quantization

These variables control which tensors are kept in floating-point during int8 post-training quantization.

CONTROL_TENSOR_NAME_PATTERNS

string

Comma-separated list of name substrings. Any parameter whose name contains one of these patterns is treated as a “control tensor” — kept in fp32 during training and excluded from int8 quantization. These are typically low-dimensional scalar/vector parameters that are sensitive to precision loss.

INT8_KEEP_FLOAT_FP32_NAME_PATTERNS

string

default:"same as CONTROL_TENSOR_NAME_PATTERNS"

Comma-separated list of name substrings. Tensors matching these patterns are kept in full fp32 in the quantized artifact rather than being downcast to fp16. Defaults to the same value as CONTROL_TENSOR_NAME_PATTERNS.

Tensors with 65,536 elements or fewer are always kept as floating-point (stored as fp16) rather than quantized to int8, regardless of these patterns. Large 2D float tensors use per-row int8 quantization; other large float tensors use per-tensor int8 quantization.

Distributed Training

These variables are set automatically by torchrun. You do not need to set them manually.

RANK

integer

Global rank of the current process across all nodes. Process 0 is the master process that writes logs and saves checkpoints.

WORLD_SIZE

integer

Total number of processes in the distributed job. Must divide 8 so that gradient accumulation steps (8 // WORLD_SIZE) remain an integer. Valid values: 1, 2, 4, 8.

LOCAL_RANK

integer

Rank of the current process on its local node. Used to select the CUDA device (cuda:<LOCAL_RANK>).

Data Pipeline

These variables configure the dataset download and tokenization scripts in data/.

MATCHED_FINEWEB_REPO_ID

string

default:"willdepueoai/parameter-golf"

Hugging Face dataset repository ID to download shards and tokenizers from.

MATCHED_FINEWEB_REMOTE_ROOT_PREFIX

string

default:"datasets"

Subdirectory prefix within the HF repo under which dataset shards and manifest are stored.

MATCHED_FINEWEB_SP_BATCH_SIZE

integer

Batch size for SentencePiece tokenizer encoding during shard export. Useful for tuning CPU-heavy export throughput.

MATCHED_FINEWEB_TOKENIZER_THREADS

integer

Number of threads for the tokenizer encoding pool during shard export.

MATCHED_FINEWEB_TIKTOKEN_THREADS

integer

Number of threads for tiktoken encoding during shard export (used when tokenizing with the tiktoken backend).

MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE

integer

Batch size for GPT-2 decoding during the blobstore docs-cache path. Useful for tuning memory vs. throughput tradeoff.

MLX-Only Variables

These variables are specific to train_gpt_mlx.py and have no effect on train_gpt.py.

MLX_MAX_MICROBATCH_TOKENS

integer

default:"8192"

Maximum tokens per sub-batch within each logical microbatch. MLX splits each microbatch into smaller chunks of at most this size to reduce peak memory pressure on Apple Silicon without changing the effective optimizer batch size.

GRAD_ACCUM_STEPS

integer

default:"8"

Number of gradient accumulation steps per optimizer update in train_gpt_mlx.py. In train_gpt.py this is always derived as 8 // WORLD_SIZE and is not independently configurable.

OUT_DIR

string

default:"logs"

Output directory for log files and model artifacts in train_gpt_mlx.py. In train_gpt.py the log directory is always logs/ and is not configurable.

LOGIT_CHUNK_TOKENS

integer

default:"0"

Number of tokens per logit computation chunk in train_gpt_mlx.py. Set to a positive value to reduce peak memory by computing the final projection and cross-entropy loss in chunks. 0 (default) computes all tokens in a single matmul.

Overview

Getting Started

Concepts

Submission Guide

Reference

Quick-Start Example

Training Hyperparameters

Data Paths

Validation

Training Length

Model Shape

Optimizer

Quantization

Distributed Training

Data Pipeline

MLX-Only Variables

Overview

Getting Started

Concepts

Submission Guide

Reference

​Quick-Start Example

​Training Hyperparameters

​Data Paths

​Validation

​Training Length

​Model Shape

​Optimizer

​Quantization

​Distributed Training

​Data Pipeline

​MLX-Only Variables

Quick-Start Example

Training Hyperparameters

Data Paths

Validation

Training Length

Model Shape

Optimizer

Quantization

Distributed Training

Data Pipeline

MLX-Only Variables