GPTConfig is a dataclass that defines all configuration parameters for the GPT model architecture. It controls the model size, attention mechanism, and regularization settings.
Class definition
model.py:108-116
Parameters
Maximum sequence length that the model can handle. This determines the size of the positional embeddings and the causal attention mask.
Size of the vocabulary. The default value of 50304 is GPT-2’s vocab_size of 50257 padded up to the nearest multiple of 64 for efficiency.
Number of transformer blocks in the model. Standard GPT-2 uses 12 layers.
Number of attention heads in each transformer block. Must evenly divide
n_embd.Dimensionality of the embeddings and hidden states throughout the model. Standard GPT-2 uses 768.
Dropout probability applied to attention weights, residual connections, and embeddings. Set to 0.0 to disable dropout.
Whether to include bias terms in Linear layers and LayerNorms. Setting to
False can make the model slightly faster and may improve performance.Usage
Create a custom configuration
Use pretrained configurations
When loading pretrained models, the configuration is automatically set based on the model type:The
n_head parameter must evenly divide n_embd since each head operates on a portion of the embedding dimension (n_embd // n_head).Model variants
Here are the standard GPT-2 configurations:| Model | n_layer | n_head | n_embd | Parameters |
|---|---|---|---|---|
| GPT-2 | 12 | 12 | 768 | 124M |
| GPT-2 Medium | 24 | 16 | 1024 | 350M |
| GPT-2 Large | 36 | 20 | 1280 | 774M |
| GPT-2 XL | 48 | 25 | 1600 | 1558M |