Skip to main content

Overview

This page documents all configuration parameters available in train.py. Default values are designed to train GPT-2 (124M parameters) on OpenWebText.
All parameters can be overridden via config files or command-line arguments. See the configuration overview for usage details.

I/O parameters

Control where outputs are saved and how often evaluations run.
out_dir
str
default:"'out'"
Directory where checkpoints and logs are saved. Created automatically if it doesn’t exist.
out_dir = 'out-shakespeare-char'
eval_interval
int
default:"2000"
Number of training iterations between evaluations. At each interval, the model evaluates on train/val sets and saves checkpoints.
eval_interval = 250  # Evaluate more frequently for small datasets
log_interval
int
default:"1"
Number of iterations between logging loss and timing statistics to console.
log_interval = 10  # Log every 10 iterations
eval_iters
int
default:"200"
Number of batches to use when estimating train/val loss during evaluation. Higher values give more accurate estimates but take longer.
eval_iters = 40  # Faster evaluation for finetuning
eval_only
bool
default:"False"
If True, the script exits immediately after the first evaluation. Useful for running evaluations without training.
eval_only = True  # Just evaluate, don't train
always_save_checkpoint
bool
default:"True"
If True, saves a checkpoint after every evaluation interval. If False, only saves when validation loss improves.
always_save_checkpoint = False  # Only save best checkpoints
init_from
str
default:"'scratch'"
How to initialize the model. Options:
  • 'scratch' - Random initialization
  • 'resume' - Load from {out_dir}/ckpt.pt
  • 'gpt2' - Load OpenAI GPT-2 124M weights
  • 'gpt2-medium' - Load GPT-2 350M
  • 'gpt2-large' - Load GPT-2 774M
  • 'gpt2-xl' - Load GPT-2 1.5B
init_from = 'gpt2-xl'  # Finetune from largest GPT-2

Weights & Biases logging

Integration with Weights & Biases for experiment tracking.
wandb_log
bool
default:"False"
Enable logging to Weights & Biases. Requires wandb package installed.
wandb_log = True  # Enable W&B logging
wandb_project
str
default:"'owt'"
W&B project name where runs are logged.
wandb_project = 'shakespeare-char'
wandb_run_name
str
default:"'gpt2'"
Name for this specific W&B run. Use dynamic names for unique identification.
import time
wandb_run_name = 'gpt2-' + str(time.time())

Data parameters

Configure dataset loading and batch processing.
dataset
str
default:"'openwebtext'"
Dataset name. Should match a directory in data/ containing train.bin and val.bin files.
dataset = 'shakespeare_char'
gradient_accumulation_steps
int
default:"40"
Number of forward/backward passes before updating weights. Simulates larger batch sizes.Effective batch size = batch_size × gradient_accumulation_steps × num_gpus
gradient_accumulation_steps = 5 * 8  # For 8 GPUs
In DDP mode, this value is automatically divided by world size (train.py:95).
batch_size
int
default:"12"
Micro-batch size per GPU. This is the number of sequences processed in one forward pass.
batch_size = 64  # Small model, can fit larger batches
block_size
int
default:"1024"
Maximum sequence length (context window) in tokens. Must be ≤ model’s block_size.
block_size = 256  # Shorter context for faster training

Model parameters

Define the transformer architecture. See model parameters for detailed explanations.
n_layer
int
default:"12"
Number of transformer blocks. GPT-2 uses 12 (124M), 24 (350M), 36 (774M), or 48 (1.5B).
n_layer = 6  # Smaller model for debugging
n_head
int
default:"12"
Number of attention heads per block. Must divide n_embd evenly.
n_head = 6  # Matches smaller n_embd
n_embd
int
default:"768"
Embedding dimension. GPT-2 base uses 768.
n_embd = 384  # Half size for baby GPT
dropout
float
default:"0.0"
Dropout probability. Use 0.0 for pretraining, 0.1+ for finetuning to prevent overfitting.
dropout = 0.2  # Regularization for small datasets
bias
bool
default:"False"
Whether to use bias in Linear and LayerNorm layers. False is slightly faster and better.
bias = True  # Required when loading OpenAI GPT-2 checkpoints

AdamW optimizer

Parameters for the AdamW optimizer.
learning_rate
float
default:"6e-4"
Maximum learning rate. This is the peak LR after warmup (if decay is enabled).
learning_rate = 1e-3  # Higher LR for small models
max_iters
int
default:"600000"
Total number of training iterations. Training stops when iter_num > max_iters.
max_iters = 5000  # Quick training for small datasets
weight_decay
float
default:"1e-1"
L2 regularization coefficient. Applied only to 2D parameters (weights, not biases or norms).
weight_decay = 1e-1  # Standard value
beta1
float
default:"0.9"
Adam beta1 parameter (exponential decay rate for first moment estimates).
beta1 = 0.9  # Standard value
beta2
float
default:"0.95"
Adam beta2 parameter (exponential decay rate for second moment estimates).
beta2 = 0.99  # Higher for small batch sizes
grad_clip
float
default:"1.0"
Gradient clipping value. Clips gradients to this max norm. Set to 0.0 to disable.
grad_clip = 1.0  # Prevent gradient explosion

Learning rate decay

Cosine learning rate schedule with warmup.
decay_lr
bool
default:"True"
Whether to use learning rate decay. If False, uses constant learning_rate.
decay_lr = False  # Constant LR for finetuning
warmup_iters
int
default:"2000"
Number of iterations for linear warmup from 0 to learning_rate.
warmup_iters = 100  # Quick warmup for small models
lr_decay_iters
int
default:"600000"
Iterations over which to decay from learning_rate to min_lr using cosine schedule.Per Chinchilla paper, should equal max_iters.
lr_decay_iters = 5000  # Match max_iters
min_lr
float
default:"6e-5"
Minimum learning rate at end of decay. Per Chinchilla, should be ~learning_rate / 10.
min_lr = 1e-4  # 1/10 of learning_rate

Distributed training

Distributed Data Parallel (DDP) settings.
backend
str
default:"'nccl'"
DDP backend. Options:
  • 'nccl' - NVIDIA GPUs (recommended)
  • 'gloo' - CPU or cross-platform
  • 'mpi' - MPI-based
backend = 'gloo'  # For CPU clusters

System parameters

Hardware and compilation settings.
device
str
default:"'cuda'"
Device to train on. Options:
  • 'cuda' - First available GPU
  • 'cuda:0', 'cuda:1', etc. - Specific GPU
  • 'cpu' - CPU only
  • 'mps' - Apple Silicon GPU
device = 'cpu'  # For MacBooks without CUDA
dtype
str
default:"'bfloat16' if supported else 'float16'"
Training precision. Options:
  • 'float32' - Full precision (slower, more memory)
  • 'bfloat16' - Best for A100/H100 GPUs
  • 'float16' - Good for older GPUs (uses GradScaler)
dtype = 'float32'  # Maximum precision
Default automatically selects 'bfloat16' if your GPU supports it, otherwise 'float16'.
compile
bool
default:"True"
Whether to use PyTorch 2.0+ torch.compile() for faster training. Requires PyTorch 2.0+.
compile = False  # Disable for debugging or PyTorch < 2.0
Compilation takes ~1 minute on first run but provides significant speedups.

Configuration examples

config/debug.py
# Quick training run for debugging
out_dir = 'out-debug'
eval_interval = 100
eval_iters = 20
log_interval = 1

batch_size = 8
block_size = 256

n_layer = 4
n_head = 4
n_embd = 256

max_iters = 1000
learning_rate = 1e-3

compile = False  # Faster startup

Configuration overview

Learn how the configurator system works

Model parameters

Understand model architecture settings

Build docs developers (and LLMs) love