Overview
This page documents all configuration parameters available intrain.py. Default values are designed to train GPT-2 (124M parameters) on OpenWebText.
All parameters can be overridden via config files or command-line arguments. See the configuration overview for usage details.
I/O parameters
Control where outputs are saved and how often evaluations run.Directory where checkpoints and logs are saved. Created automatically if it doesn’t exist.
Number of training iterations between evaluations. At each interval, the model evaluates on train/val sets and saves checkpoints.
Number of iterations between logging loss and timing statistics to console.
Number of batches to use when estimating train/val loss during evaluation. Higher values give more accurate estimates but take longer.
If
True, the script exits immediately after the first evaluation. Useful for running evaluations without training.If
True, saves a checkpoint after every evaluation interval. If False, only saves when validation loss improves.How to initialize the model. Options:
'scratch'- Random initialization'resume'- Load from{out_dir}/ckpt.pt'gpt2'- Load OpenAI GPT-2 124M weights'gpt2-medium'- Load GPT-2 350M'gpt2-large'- Load GPT-2 774M'gpt2-xl'- Load GPT-2 1.5B
Weights & Biases logging
Integration with Weights & Biases for experiment tracking.Enable logging to Weights & Biases. Requires
wandb package installed.W&B project name where runs are logged.
Name for this specific W&B run. Use dynamic names for unique identification.
Data parameters
Configure dataset loading and batch processing.Dataset name. Should match a directory in
data/ containing train.bin and val.bin files.Number of forward/backward passes before updating weights. Simulates larger batch sizes.Effective batch size =
batch_size × gradient_accumulation_steps × num_gpusMicro-batch size per GPU. This is the number of sequences processed in one forward pass.
Maximum sequence length (context window) in tokens. Must be ≤ model’s
block_size.Model parameters
Define the transformer architecture. See model parameters for detailed explanations.Number of transformer blocks. GPT-2 uses 12 (124M), 24 (350M), 36 (774M), or 48 (1.5B).
Number of attention heads per block. Must divide
n_embd evenly.Embedding dimension. GPT-2 base uses 768.
Dropout probability. Use 0.0 for pretraining, 0.1+ for finetuning to prevent overfitting.
Whether to use bias in Linear and LayerNorm layers.
False is slightly faster and better.AdamW optimizer
Parameters for the AdamW optimizer.Maximum learning rate. This is the peak LR after warmup (if decay is enabled).
Total number of training iterations. Training stops when
iter_num > max_iters.L2 regularization coefficient. Applied only to 2D parameters (weights, not biases or norms).
Adam beta1 parameter (exponential decay rate for first moment estimates).
Adam beta2 parameter (exponential decay rate for second moment estimates).
Gradient clipping value. Clips gradients to this max norm. Set to
0.0 to disable.Learning rate decay
Cosine learning rate schedule with warmup.Whether to use learning rate decay. If
False, uses constant learning_rate.Number of iterations for linear warmup from 0 to
learning_rate.Iterations over which to decay from
learning_rate to min_lr using cosine schedule.Per Chinchilla paper, should equal max_iters.Minimum learning rate at end of decay. Per Chinchilla, should be ~
learning_rate / 10.Distributed training
Distributed Data Parallel (DDP) settings.DDP backend. Options:
'nccl'- NVIDIA GPUs (recommended)'gloo'- CPU or cross-platform'mpi'- MPI-based
System parameters
Hardware and compilation settings.Device to train on. Options:
'cuda'- First available GPU'cuda:0','cuda:1', etc. - Specific GPU'cpu'- CPU only'mps'- Apple Silicon GPU
Training precision. Options:
'float32'- Full precision (slower, more memory)'bfloat16'- Best for A100/H100 GPUs'float16'- Good for older GPUs (uses GradScaler)
Default automatically selects
'bfloat16' if your GPU supports it, otherwise 'float16'.Whether to use PyTorch 2.0+
torch.compile() for faster training. Requires PyTorch 2.0+.Configuration examples
- Small model (debugging)
- Finetuning
- Full-scale training
config/debug.py
Related pages
Configuration overview
Learn how the configurator system works
Model parameters
Understand model architecture settings