Model

GPT class

The main GPT Language Model implementation. This class provides the complete transformer architecture with support for training, inference, and text generation.

Constructor

GPT(config)

Initializes a new GPT model instance.

config

GPTConfig

required

Configuration object specifying model architecture parameters

Attributes

config

GPTConfig

The configuration object containing model hyperparameters

transformer

nn.ModuleDict

Container with the following components:

wte: Token embedding layer (vocab_size x n_embd)
wpe: Position embedding layer (block_size x n_embd)
drop: Dropout layer
h: ModuleList of transformer blocks
ln_f: Final layer normalization

lm_head

nn.Linear

Language model head that projects embeddings to vocabulary logits (n_embd x vocab_size, without bias)

Methods

forward

forward(idx, targets=None)

Performs a forward pass through the model.

idx

torch.LongTensor

required

Input token indices of shape (batch_size, sequence_length)

targets

torch.LongTensor

Target token indices for computing loss. If provided, loss will be calculated using cross-entropy.

Returns: Tuple of (logits, loss)

logits: Predicted logits of shape (batch_size, sequence_length, vocab_size) during training, or (batch_size, 1, vocab_size) during inference
loss: Cross-entropy loss if targets are provided, otherwise None

Show Implementation details

During training with targets, logits are computed for all positions. During inference without targets, only the final position logits are computed for efficiency.The forward pass:

Creates token embeddings from input indices
Adds positional embeddings
Applies dropout
Passes through transformer blocks
Applies final layer normalization
Projects to vocabulary space via language model head

generate

@torch.no_grad()
generate(idx, max_new_tokens, temperature=1.0, top_k=None)

Generates new tokens autoregressively from a conditioning sequence.

idx

torch.LongTensor

required

Conditioning sequence of token indices with shape (batch_size, sequence_length)

max_new_tokens

int

required

Number of new tokens to generate

temperature

float

default:"1.0"

Sampling temperature. Values < 1.0 make the model more confident (less random), values > 1.0 make it more diverse (more random).

top_k

int

If specified, only the top_k most likely tokens are considered for sampling. Others are set to zero probability.

Returns: torch.LongTensor of shape (batch_size, sequence_length + max_new_tokens) with generated tokens appended

Make sure the model is in eval mode (model.eval()) before calling this method for generation.

from_pretrained

@classmethod
from_pretrained(cls, model_type, override_args=None)

Loads a pretrained GPT-2 model from OpenAI/HuggingFace.

model_type

str

required

One of: 'gpt2' (124M), 'gpt2-medium' (350M), 'gpt2-large' (774M), or 'gpt2-xl' (1558M)

override_args

dict

Optional arguments to override. Currently only dropout can be overridden.

Returns: GPT model instance with loaded pretrained weights

Show Model configurations

Model Type	Layers	Heads	Embedding Dim	Parameters
gpt2	12	12	768	124M
gpt2-medium	24	16	1024	350M
gpt2-large	36	20	1280	774M
gpt2-xl	48	25	1600	1558M

All pretrained models use:

vocab_size: 50257
block_size: 1024
bias: True

crop_block_size

crop_block_size(block_size)

Reduces the model’s context length via model surgery.

block_size

int

required

New block size (must be ≤ current block_size)

This method modifies the model in-place by truncating position embeddings and attention bias matrices. Use this when you want to reduce context length, for example when loading GPT-2 (block size 1024) but using a smaller context window.

get_num_params

get_num_params(non_embedding=True)

Counts the total number of parameters in the model.

non_embedding

bool

default:"True"

If True, excludes position embeddings from the count (recommended for fair comparison since they’re not used in the final layer)

Returns: int - Total number of parameters

configure_optimizers

configure_optimizers(weight_decay, learning_rate, betas, device_type)

Creates an AdamW optimizer with weight decay applied only to certain parameters.

weight_decay

float

required

Weight decay coefficient (typically 0.1)

learning_rate

float

required

Learning rate for the optimizer

betas

tuple

required

Beta coefficients for AdamW (typically (0.9, 0.95))

device_type

str

required

Device type (‘cuda’ or ‘cpu’) - used to determine if fused AdamW is available

Returns: torch.optim.AdamW optimizer

Show Parameter grouping strategy

The method intelligently groups parameters:

With weight decay: All 2D+ parameters (weight tensors in matrix multiplications and embeddings)
Without weight decay: All 1D parameters (biases and layer normalization parameters)

This follows best practices from the original GPT papers.

estimate_mfu

estimate_mfu(fwdbwd_per_iter, dt)

Estimates Model FLOPs Utilization (MFU) in units of A100 bfloat16 peak FLOPS.

fwdbwd_per_iter

int

required

Number of forward-backward passes per iteration (typically batch_size * gradient_accumulation_steps)

float

required

Time delta in seconds for the iteration

Returns: float - MFU as a ratio (0.0 to 1.0+) where 1.0 represents 100% of A100 peak performance

MFU calculation is based on the PaLM paper (Appendix B). A100 bfloat16 peak FLOPS is assumed to be 312 TFLOPS.

GPTConfig

Dataclass containing model architecture configuration.

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True

block_size

int

default:"1024"

Maximum sequence length / context window size

vocab_size

int

default:"50304"

Vocabulary size. Default is GPT-2’s 50257 rounded up to nearest multiple of 64 for efficiency.

n_layer

int

default:"12"

Number of transformer blocks

n_head

int

default:"12"

Number of attention heads

n_embd

int

default:"768"

Embedding dimension size

dropout

float

default:"0.0"

Dropout probability. Use 0.0 for pretraining, try 0.1+ for finetuning.

bias

bool

default:"True"

Whether to use bias in Linear and LayerNorm layers. True matches GPT-2, False is slightly better and faster.

Core Components

Architecture

GPT class

Constructor

Attributes

Methods

forward

generate

from_pretrained

crop_block_size

get_num_params

configure_optimizers

estimate_mfu

GPTConfig

Build docs developers (and LLMs) love

Core Components

Architecture

​GPT class

​Constructor

​Attributes

​Methods

​forward

​generate

​from_pretrained

​crop_block_size

​get_num_params

​configure_optimizers

​estimate_mfu

​GPTConfig

Build docs developers (and LLMs) love

GPT class

Constructor

Attributes

Methods

forward

generate

from_pretrained

crop_block_size

get_num_params

configure_optimizers

estimate_mfu

GPTConfig