Skip to main content

GPT class

The main GPT Language Model implementation. This class provides the complete transformer architecture with support for training, inference, and text generation.

Constructor

GPT(config)
Initializes a new GPT model instance.
config
GPTConfig
required
Configuration object specifying model architecture parameters

Attributes

config
GPTConfig
The configuration object containing model hyperparameters
transformer
nn.ModuleDict
Container with the following components:
  • wte: Token embedding layer (vocab_size x n_embd)
  • wpe: Position embedding layer (block_size x n_embd)
  • drop: Dropout layer
  • h: ModuleList of transformer blocks
  • ln_f: Final layer normalization
lm_head
nn.Linear
Language model head that projects embeddings to vocabulary logits (n_embd x vocab_size, without bias)

Methods

forward

forward(idx, targets=None)
Performs a forward pass through the model.
idx
torch.LongTensor
required
Input token indices of shape (batch_size, sequence_length)
targets
torch.LongTensor
Target token indices for computing loss. If provided, loss will be calculated using cross-entropy.
Returns: Tuple of (logits, loss)
  • logits: Predicted logits of shape (batch_size, sequence_length, vocab_size) during training, or (batch_size, 1, vocab_size) during inference
  • loss: Cross-entropy loss if targets are provided, otherwise None

generate

@torch.no_grad()
generate(idx, max_new_tokens, temperature=1.0, top_k=None)
Generates new tokens autoregressively from a conditioning sequence.
idx
torch.LongTensor
required
Conditioning sequence of token indices with shape (batch_size, sequence_length)
max_new_tokens
int
required
Number of new tokens to generate
temperature
float
default:"1.0"
Sampling temperature. Values < 1.0 make the model more confident (less random), values > 1.0 make it more diverse (more random).
top_k
int
If specified, only the top_k most likely tokens are considered for sampling. Others are set to zero probability.
Returns: torch.LongTensor of shape (batch_size, sequence_length + max_new_tokens) with generated tokens appended
Make sure the model is in eval mode (model.eval()) before calling this method for generation.

from_pretrained

@classmethod
from_pretrained(cls, model_type, override_args=None)
Loads a pretrained GPT-2 model from OpenAI/HuggingFace.
model_type
str
required
One of: 'gpt2' (124M), 'gpt2-medium' (350M), 'gpt2-large' (774M), or 'gpt2-xl' (1558M)
override_args
dict
Optional arguments to override. Currently only dropout can be overridden.
Returns: GPT model instance with loaded pretrained weights

crop_block_size

crop_block_size(block_size)
Reduces the model’s context length via model surgery.
block_size
int
required
New block size (must be ≤ current block_size)
This method modifies the model in-place by truncating position embeddings and attention bias matrices. Use this when you want to reduce context length, for example when loading GPT-2 (block size 1024) but using a smaller context window.

get_num_params

get_num_params(non_embedding=True)
Counts the total number of parameters in the model.
non_embedding
bool
default:"True"
If True, excludes position embeddings from the count (recommended for fair comparison since they’re not used in the final layer)
Returns: int - Total number of parameters

configure_optimizers

configure_optimizers(weight_decay, learning_rate, betas, device_type)
Creates an AdamW optimizer with weight decay applied only to certain parameters.
weight_decay
float
required
Weight decay coefficient (typically 0.1)
learning_rate
float
required
Learning rate for the optimizer
betas
tuple
required
Beta coefficients for AdamW (typically (0.9, 0.95))
device_type
str
required
Device type (‘cuda’ or ‘cpu’) - used to determine if fused AdamW is available
Returns: torch.optim.AdamW optimizer

estimate_mfu

estimate_mfu(fwdbwd_per_iter, dt)
Estimates Model FLOPs Utilization (MFU) in units of A100 bfloat16 peak FLOPS.
fwdbwd_per_iter
int
required
Number of forward-backward passes per iteration (typically batch_size * gradient_accumulation_steps)
dt
float
required
Time delta in seconds for the iteration
Returns: float - MFU as a ratio (0.0 to 1.0+) where 1.0 represents 100% of A100 peak performance
MFU calculation is based on the PaLM paper (Appendix B). A100 bfloat16 peak FLOPS is assumed to be 312 TFLOPS.

GPTConfig

Dataclass containing model architecture configuration.
@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True
block_size
int
default:"1024"
Maximum sequence length / context window size
vocab_size
int
default:"50304"
Vocabulary size. Default is GPT-2’s 50257 rounded up to nearest multiple of 64 for efficiency.
n_layer
int
default:"12"
Number of transformer blocks
n_head
int
default:"12"
Number of attention heads
n_embd
int
default:"768"
Embedding dimension size
dropout
float
default:"0.0"
Dropout probability. Use 0.0 for pretraining, try 0.1+ for finetuning.
bias
bool
default:"True"
Whether to use bias in Linear and LayerNorm layers. True matches GPT-2, False is slightly better and faster.

Build docs developers (and LLMs) love