GPTConfig

GPTConfig is a dataclass that defines all configuration parameters for the GPT model architecture. It controls the model size, attention mechanism, and regularization settings.

Class definition

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True

Location: model.py:108-116

Parameters

block_size

int

default:"1024"

Maximum sequence length that the model can handle. This determines the size of the positional embeddings and the causal attention mask.

vocab_size

int

default:"50304"

Size of the vocabulary. The default value of 50304 is GPT-2’s vocab_size of 50257 padded up to the nearest multiple of 64 for efficiency.

n_layer

int

default:"12"

Number of transformer blocks in the model. Standard GPT-2 uses 12 layers.

n_head

int

default:"12"

Number of attention heads in each transformer block. Must evenly divide n_embd.

n_embd

int

default:"768"

Dimensionality of the embeddings and hidden states throughout the model. Standard GPT-2 uses 768.

dropout

float

default:"0.0"

Dropout probability applied to attention weights, residual connections, and embeddings. Set to 0.0 to disable dropout.

bias

bool

default:"True"

Whether to include bias terms in Linear layers and LayerNorms. Setting to False can make the model slightly faster and may improve performance.

Usage

Create a custom configuration

from model import GPTConfig

# Create a smaller GPT model
config = GPTConfig(
    block_size=512,
    vocab_size=50304,
    n_layer=6,
    n_head=6,
    n_embd=384,
    dropout=0.1,
    bias=False
)

Use pretrained configurations

When loading pretrained models, the configuration is automatically set based on the model type:

config_args = {
    'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
    'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
    'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
    'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
}

The n_head parameter must evenly divide n_embd since each head operates on a portion of the embedding dimension (n_embd // n_head).

Model variants

Here are the standard GPT-2 configurations:

Model	n_layer	n_head	n_embd	Parameters
GPT-2	12	12	768	124M
GPT-2 Medium	24	16	1024	350M
GPT-2 Large	36	20	1280	774M
GPT-2 XL	48	25	1600	1558M

Core Components

Architecture

Class definition

Parameters

Usage

Create a custom configuration

Use pretrained configurations

Model variants

Build docs developers (and LLMs) love

Core Components

Architecture

​Class definition

​Parameters

​Usage

​Create a custom configuration

​Use pretrained configurations

​Model variants

Build docs developers (and LLMs) love

Class definition

Parameters

Usage

Create a custom configuration

Use pretrained configurations

Model variants