Overview
The GPT model architecture is defined by theGPTConfig dataclass in model.py:108-116. These parameters control the transformer’s structure and capacity.
Model parameters can be set via the same configuration system as training parameters. See the configuration overview for details.
GPTConfig parameters
These parameters define the model architecture. Changes require retraining from scratch (unless loading pretrained weights).Number of transformer blocks (layers) in the model. Each block contains one attention layer and one feedforward layer.Impact: Controls model depth. More layers = more capacity but slower training.GPT-2 variants:
- GPT-2 base: 12 layers (124M params)
- GPT-2 medium: 24 layers (350M params)
- GPT-2 large: 36 layers (774M params)
- GPT-2 XL: 48 layers (1.5B params)
Number of attention heads in each multi-head attention layer.Constraint: Must divide
n_embd evenly. Each head has dimension n_embd / n_head.Impact: More heads allow attending to different positions simultaneously. Typical head dimension is 64.GPT-2 variants:- GPT-2 base: 12 heads
- GPT-2 medium: 16 heads
- GPT-2 large: 20 heads
- GPT-2 XL: 25 heads
Embedding dimension (model width). This is the size of the hidden states throughout the model.Impact: Controls model width. Larger values = more capacity and memory usage.GPT-2 variants:
- GPT-2 base: 768
- GPT-2 medium: 1024
- GPT-2 large: 1280
- GPT-2 XL: 1600
The feedforward layer expands to
4 * n_embd internally (see model.py:82).Maximum sequence length (context window) the model can process. This is the size of the positional embedding table.Impact: Determines maximum context length. Larger values allow longer sequences but increase memory quadratically for attention.
Size of the vocabulary (number of unique tokens). This determines the size of token embedding and output layers.Default explained: GPT-2 uses 50257 tokens, padded to 50304 (nearest multiple of 64) for computational efficiency.
The training script automatically detects
vocab_size from data/{dataset}/meta.pkl if available (train.py:138-144).Dropout probability applied throughout the model:
- Attention dropout (
model.py:39) - Residual dropout after attention (
model.py:40) - MLP dropout (
model.py:85) - Embedding dropout (
model.py:129)
- Pretraining: Use
0.0(no dropout) - Finetuning: Use
0.1to0.2to prevent overfitting on small datasets
Whether to include bias terms in Linear layers and LayerNorm.Impact:
True: Compatible with OpenAI GPT-2 checkpoints, standard practiceFalse: Slightly faster and often performs better (modern recommendation)
Architecture details
Model structure
The GPT model (model.py:118-331) consists of:
Block contains:
Parameter count
The number of parameters is approximately:| Model | n_layer | n_head | n_embd | Parameters |
|---|---|---|---|---|
| GPT-2 base | 12 | 12 | 768 | 124M |
| GPT-2 medium | 24 | 16 | 1024 | 350M |
| GPT-2 large | 36 | 20 | 1280 | 774M |
| GPT-2 XL | 48 | 25 | 1600 | 1558M |
Flash Attention
The model automatically uses Flash Attention if available (PyTorch ≥ 2.0):- Faster attention computation
- Lower memory usage
- Exact numerical equivalence
Configuration examples
- Baby GPT
- Small GPT
- GPT-2 Base
- GPT-2 Medium
Finetuning considerations
When finetuning pretrained models, certain parameters are fixed and cannot be changed:Fixed parameters (resume/pretrained)
These are locked when usinginit_from='resume' or init_from='gpt2*' (train.py:166-167):
n_layern_headn_embdblock_sizebiasvocab_size
Overridable parameters
Onlydropout can be overridden when loading pretrained weights (train.py:184):
You can reduce
block_size via model surgery using crop_block_size() (train.py:190-192), but not increase it.Advanced topics
Weight tying
Weight tying
The token embedding weights are shared (tied) with the output layer:This reduces parameters and often improves performance. See Weight Tying paper.
Initialization
Initialization
Weights are initialized with:
- Linear layers: Normal(0, 0.02)
- Embeddings: Normal(0, 0.02)
- Residual projections: Normal(0, 0.02/√(2*n_layer))
model.py:141-145 for the scaled initialization of residual connections.Head dimension
Head dimension
Each attention head has dimension
64 is the most common head dimension across transformer models.
n_embd / n_head. Common choices:| n_embd | n_head | Head dim |
|---|---|---|
| 256 | 4 | 64 |
| 384 | 6 | 64 |
| 512 | 8 | 64 |
| 768 | 12 | 64 |
| 1024 | 16 | 64 |
Memory requirements
Memory requirements
Approximate GPU memory for training (batch_size=1, block_size=1024, mixed precision):
Scale proportionally with batch_size and gradient_accumulation_steps.
| Model | Parameters | Training memory |
|---|---|---|
| 10M | 10M | ~2 GB |
| 124M | 124M | ~8 GB |
| 350M | 350M | ~16 GB |
| 774M | 774M | ~32 GB |
| 1.5B | 1.5B | ~48 GB |
Model surgery
Model surgery
You can reduce the context window after loading a checkpoint:This truncates positional embeddings and attention bias buffers (
model.py:195-204).Validation
The model validates configuration at initialization:Related pages
Configuration overview
Learn how to set these parameters via config files
Training parameters
Configure optimizer, learning rate, and training loop