GPT class
The main GPT Language Model implementation. This class provides the complete transformer architecture with support for training, inference, and text generation.Constructor
Configuration object specifying model architecture parameters
Attributes
The configuration object containing model hyperparameters
Container with the following components:
wte: Token embedding layer (vocab_size x n_embd)wpe: Position embedding layer (block_size x n_embd)drop: Dropout layerh: ModuleList of transformer blocksln_f: Final layer normalization
Language model head that projects embeddings to vocabulary logits (n_embd x vocab_size, without bias)
Methods
forward
Input token indices of shape (batch_size, sequence_length)
Target token indices for computing loss. If provided, loss will be calculated using cross-entropy.
logits: Predicted logits of shape (batch_size, sequence_length, vocab_size) during training, or (batch_size, 1, vocab_size) during inferenceloss: Cross-entropy loss if targets are provided, otherwise None
generate
Conditioning sequence of token indices with shape (batch_size, sequence_length)
Number of new tokens to generate
Sampling temperature. Values < 1.0 make the model more confident (less random), values > 1.0 make it more diverse (more random).
If specified, only the top_k most likely tokens are considered for sampling. Others are set to zero probability.
Make sure the model is in eval mode (
model.eval()) before calling this method for generation.from_pretrained
One of:
'gpt2' (124M), 'gpt2-medium' (350M), 'gpt2-large' (774M), or 'gpt2-xl' (1558M)Optional arguments to override. Currently only
dropout can be overridden.crop_block_size
New block size (must be ≤ current block_size)
get_num_params
If True, excludes position embeddings from the count (recommended for fair comparison since they’re not used in the final layer)
configure_optimizers
Weight decay coefficient (typically 0.1)
Learning rate for the optimizer
Beta coefficients for AdamW (typically (0.9, 0.95))
Device type (‘cuda’ or ‘cpu’) - used to determine if fused AdamW is available
estimate_mfu
Number of forward-backward passes per iteration (typically batch_size * gradient_accumulation_steps)
Time delta in seconds for the iteration
MFU calculation is based on the PaLM paper (Appendix B). A100 bfloat16 peak FLOPS is assumed to be 312 TFLOPS.
GPTConfig
Dataclass containing model architecture configuration.Maximum sequence length / context window size
Vocabulary size. Default is GPT-2’s 50257 rounded up to nearest multiple of 64 for efficiency.
Number of transformer blocks
Number of attention heads
Embedding dimension size
Dropout probability. Use 0.0 for pretraining, try 0.1+ for finetuning.
Whether to use bias in Linear and LayerNorm layers. True matches GPT-2, False is slightly better and faster.