Block
TheBlock class represents a single transformer block with pre-normalization architecture.
Class definition
model.py:94-106
Parameters
Configuration object containing model hyperparameters.
Components
First layer normalization applied before the attention layer.
Multi-head causal self-attention mechanism.
Second layer normalization applied before the MLP layer.
Feedforward network applied after attention.
Architecture
The Block implements pre-normalization with residual connections:- Apply LayerNorm to input
- Apply attention
- Add residual connection
- Apply LayerNorm
- Apply MLP
- Add residual connection
This uses pre-normalization (LayerNorm before the sublayer) rather than post-normalization, which tends to be more stable for training.
MLP
TheMLP class is a two-layer feedforward network with GELU activation.
Class definition
model.py:78-92
Parameters
Configuration object containing model hyperparameters.
Components
First linear layer that expands dimensionality from
n_embd to 4 * n_embd.Gaussian Error Linear Unit activation function.
Second linear layer that projects back down from
4 * n_embd to n_embd.Dropout layer applied to the output.
Architecture
The MLP follows the standard transformer feedforward network design:The hidden dimension is 4x the embedding dimension, which is standard in transformer architectures.
LayerNorm
Custom LayerNorm implementation with optional bias parameter.Class definition
model.py:18-27
Parameters
Dimensionality to normalize over (typically
config.n_embd).Whether to include a learnable bias parameter. PyTorch’s standard LayerNorm doesn’t support
bias=False.Components
Learnable scale parameter initialized to ones with shape
(ndim,).Optional learnable bias parameter initialized to zeros with shape
(ndim,). Set to None if bias=False.Why custom LayerNorm?
PyTorch’s built-innn.LayerNorm doesn’t support disabling the bias parameter. This implementation allows you to set bias=False in the config for potentially better performance.
The epsilon value is fixed at
1e-5 for numerical stability during normalization.