BatchNorm1d
Batch Normalization layer for 1D or 2D inputs.Constructor
numFeatures- Number of features (C from input shape)options.eps- Small constant for numerical stability (default: 1e-5)options.momentum- Momentum for running statistics (default: 0.1)options.affine- If true, learns scale (gamma) and shift (beta) parameters (default: true)options.trackRunningStats- If true, tracks running mean/variance (default: true)
InvalidParameterError- If numFeatures is invalid
Formula
E[x]is the batch meanVar[x]is the batch variancegammaandbetaare learnable parameters (ifaffine=true)
Behavior
Training Mode:- Uses batch statistics (mean and variance from current batch)
- Updates running statistics with exponential moving average
- Uses running statistics (accumulated during training)
- Provides consistent normalization for single samples
Shape
Input:- 2D:
(batch, num_features) - 3D:
(batch, num_features, length)
Properties
weight(gamma) - Learnable scale parameter of shape(num_features,)bias(beta) - Learnable shift parameter of shape(num_features,)running_mean- Running mean buffer of shape(num_features,)running_var- Running variance buffer of shape(num_features,)
Example
In Neural Networks
Benefits
- Faster Training: Allows higher learning rates
- Reduces Covariate Shift: Normalizes activations
- Regularization: Acts as a regularizer (slight)
- Gradient Flow: Helps prevent vanishing/exploding gradients
LayerNorm
Layer Normalization. Normalizes across features for each sample independently.Constructor
normalizedShape- Shape of the normalized dimensions (single number or array)options.eps- Small constant for numerical stability (default: 1e-5)options.elementwiseAffine- If true, learns scale and shift (default: true)
InvalidParameterError- If normalizedShape is invalid
Formula
E[x]andVar[x]are computed over the normalized dimensions- Computed independently for each sample (no batch statistics)
Shape
Input:(..., *normalized_shape)
The input must end with the dimensions specified by normalizedShape.
Output: Same shape as input
Behavior
- Works the same in training and evaluation modes
- No running statistics needed
- Normalizes each sample independently
- Common in transformers and RNNs
Examples
1D Normalization
Multi-dimensional Normalization
In Transformers
Benefits
- Sample Independence: No batch statistics, works with any batch size
- RNN Friendly: Good for sequences with varying lengths
- Transformer Standard: Used in BERT, GPT, etc.
- Training/Eval Consistency: Same behavior in both modes
Dropout
Dropout regularization layer.Constructor
p- Probability of an element being zeroed (0 ≤ p < 1)
InvalidParameterError- If p is not in valid range [0, 1)
Formula
Training:mask is a binary tensor with probability (1-p) of being 1.
Evaluation:
Behavior
Training Mode:- Randomly zeros elements with probability
p - Scales remaining elements by
1 / (1 - p)(inverted dropout) - Provides regularization
- Returns input unchanged
- No randomness
Properties
dropoutRate: number- The dropout probability
Examples
Basic Usage
In Neural Networks
Different Dropout Rates
Purpose
- Prevents Overfitting: Forces network to learn redundant representations
- Ensemble Effect: Approximates training ensemble of networks
- Robust Features: Prevents co-adaptation of neurons
- Improves Generalization: Better test performance
Best Practices
- Typical Rates: 0.2-0.5 for hidden layers, 0.5 for fully connected
- Not for Convolutions: Usually not applied to CNN layers (use sparingly)
- Training vs Eval: Always remember to set model.train() / model.eval()
- After Activations: Usually applied after activation functions
- Not on Output: Don’t use on final layer
Comparison
| Feature | BatchNorm1d | LayerNorm | Dropout | |---------|-------------|-----------|---------|| | Normalizes | Across batch | Across features | N/A (zeros) | | Statistics | Batch & Running | Per sample | N/A | | Training/Eval | Different | Same | Different | | Use Case | CNNs, MLPs | Transformers, RNNs | All networks | | Parameters | gamma, beta | gamma, beta | None | | Batch Size | Needs > 1 | Works with 1 | Any |Complete Example
Tips
- BatchNorm: Use for CNNs and MLPs with batch training
- LayerNorm: Use for transformers, RNNs, and small batch sizes
- Dropout: Use everywhere for regularization, except:
- Usually not in CNNs (BatchNorm provides regularization)
- Never in output layer
- Not in batch norm layers
- Order: Linear -> Norm -> Activation -> Dropout
- Mode Switching: Always call
model.train()/model.eval()appropriately
See Also
- Linear Layer - Fully connected layers
- Activation Functions - ReLU, etc.
- Convolutional Layers - Conv2d with BatchNorm
- Attention Layers - Transformers with LayerNorm