Overview
The MLP module provides multi-layer perceptron implementations for use in neural architectures. The primary implementation isGatedMLP, which uses a gating mechanism with activation functions for improved expressiveness.
GatedMLP
A gated multi-layer perceptron that splits the hidden representation into two paths: a value path and a gate path, then multiplies them together.Architecture
Class Definition
Parameters
Number of input features.
Number of hidden features in the MLP. If
None, defaults to int(8 * in_features / 3), which is commonly used in Transformer models (approximately 2.67× expansion).Number of output features. If
None, uses in_features for a residual-compatible architecture.Activation function to apply to the gate path. Common choices:
F.silu(Swish): Smooth, non-monotonic activationF.gelu: Gaussian Error Linear UnitF.relu: Rectified Linear Unit
Whether to include bias terms in the linear layers. Typically set to
False when using layer normalization before the MLP.Round
hidden_features up to be a multiple of this value for optimal hardware utilization. Common values are 128 or 256 for GPU efficiency.Device to place tensors on (e.g.,
torch.device('cuda')).Data type for tensors (e.g.,
torch.float16, torch.bfloat16).Methods
forward
Parameters
Input tensor of shape
(..., in_features). Can handle arbitrary batch dimensions.Returns
Output tensor of shape
(..., out_features).Usage Examples
Basic Usage
With Default Hidden Size
Different Activation Functions
Custom Output Dimension
Optimized for Hardware
Integration with Block
No MLP (Identity)
Implementation Details
Gating Mechanism
The gated MLP uses a multiplicative gating mechanism:- Project input to
2 × hidden_featuresdimensions - Split into two equal parts: value and gate
- Apply activation to gate path
- Multiply value by activated gate
- Project back to output dimension
Hidden Dimension Calculation
The default hidden dimension formulaint(8 * in_features / 3) gives approximately 2.67× expansion:
- For 768 features:
8 × 768 / 3 = 2048 - For 1024 features:
8 × 1024 / 3 = 2730→ rounded to 2816 (withmultiple_of=128)
Memory Alignment
Themultiple_of parameter ensures hidden dimensions are multiples of specified values (typically 128 or 256) for optimal GPU memory access patterns and tensor core utilization.
Performance Considerations
- Gating vs. Standard MLP: Gated MLPs typically provide better performance with similar parameter counts
- Activation Choice:
SiLU/Swish: Smooth, often better gradient flowGELU: Similar to SiLU, common in BERT-style modelsReLU: Fastest but may have dying neuron issues
- Memory Alignment: Always use
multiple_offor production models to ensure optimal GPU utilization - Bias Terms: When using layer normalization, bias can often be disabled (
bias=False) for efficiency
Notes
- The gating mechanism doubles the number of parameters in
fc1compared to a standard MLP - Hidden features are automatically rounded up to the nearest multiple of
multiple_of - The activation function is only applied to the gate path, not the value path
- Works with arbitrary input tensor shapes, not just 2D or 3D tensors
