Skip to main content
GPTConfig is a dataclass that defines all configuration parameters for the GPT model architecture. It controls the model size, attention mechanism, and regularization settings.

Class definition

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True
Location: model.py:108-116

Parameters

block_size
int
default:"1024"
Maximum sequence length that the model can handle. This determines the size of the positional embeddings and the causal attention mask.
vocab_size
int
default:"50304"
Size of the vocabulary. The default value of 50304 is GPT-2’s vocab_size of 50257 padded up to the nearest multiple of 64 for efficiency.
n_layer
int
default:"12"
Number of transformer blocks in the model. Standard GPT-2 uses 12 layers.
n_head
int
default:"12"
Number of attention heads in each transformer block. Must evenly divide n_embd.
n_embd
int
default:"768"
Dimensionality of the embeddings and hidden states throughout the model. Standard GPT-2 uses 768.
dropout
float
default:"0.0"
Dropout probability applied to attention weights, residual connections, and embeddings. Set to 0.0 to disable dropout.
bias
bool
default:"True"
Whether to include bias terms in Linear layers and LayerNorms. Setting to False can make the model slightly faster and may improve performance.

Usage

Create a custom configuration

from model import GPTConfig

# Create a smaller GPT model
config = GPTConfig(
    block_size=512,
    vocab_size=50304,
    n_layer=6,
    n_head=6,
    n_embd=384,
    dropout=0.1,
    bias=False
)

Use pretrained configurations

When loading pretrained models, the configuration is automatically set based on the model type:
config_args = {
    'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
    'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
    'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
    'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
}
The n_head parameter must evenly divide n_embd since each head operates on a portion of the embedding dimension (n_embd // n_head).

Model variants

Here are the standard GPT-2 configurations:
Modeln_layern_headn_embdParameters
GPT-21212768124M
GPT-2 Medium24161024350M
GPT-2 Large36201280774M
GPT-2 XL482516001558M

Build docs developers (and LLMs) love