ReLU
Rectified Linear Unit activation function.ReLU(x) = max(0, x)
Properties:
- Most widely used activation function
- Helps prevent vanishing gradient problem
- Computationally efficient
- Can cause “dying ReLU” problem (neurons always output 0)
Sigmoid
Sigmoid activation function.Sigmoid(x) = 1 / (1 + exp(-x))
Properties:
- Output range: (0, 1)
- Used for binary classification
- Can cause vanishing gradients for extreme values
- Outputs can be interpreted as probabilities
Tanh
Hyperbolic Tangent activation function.Tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
Properties:
- Output range: (-1, 1)
- Zero-centered (better than sigmoid)
- Still suffers from vanishing gradients
- Common in RNNs and LSTMs
LeakyReLU
Leaky Rectified Linear Unit activation.LeakyReLU(x) = max(alpha * x, x)
Parameters:
alpha- Slope for negative values (default: 0.01)
- Prevents dying ReLU problem
- Allows small gradient when x < 0
- Common alternative to ReLU
ELU
Exponential Linear Unit activation.alpha- Scale for negative values (default: 1.0)
- Can produce negative outputs
- Pushes mean activations closer to zero
- Smooth function everywhere
- More computationally expensive than ReLU
GELU
Gaussian Error Linear Unit activation.GELU(x) = x * Phi(x)
Where Phi(x) is the cumulative distribution function of the standard normal distribution.
Properties:
- Used in BERT and GPT models
- Smooth approximation of ReLU
- Better than ReLU for transformers
- State-of-the-art for many NLP tasks
Softmax
Softmax activation function for multi-class classification.Softmax(x_i) = exp(x_i) / sum(exp(x_j))
Parameters:
axis- Axis along which to compute softmax (default: -1, last axis)
- Converts logits to probability distribution
- Output sums to 1.0
- Used in final layer for classification
- Numerically stable implementation
LogSoftmax
Log Softmax activation function.LogSoftmax(x_i) = log(exp(x_i) / sum(exp(x_j)))
Parameters:
axis- Axis along which to compute log-softmax (default: -1)
- More numerically stable than
log(softmax(x)) - Used with NLLLoss for classification
- Prevents numerical underflow
- Preferred over Softmax + Log
Softplus
Softplus activation function.Softplus(x) = log(1 + exp(x))
Properties:
- Smooth approximation of ReLU
- Always positive output
- Differentiable everywhere
- Can cause numerical overflow for large x
Swish
Swish (SiLU) activation function.Swish(x) = x * sigmoid(x)
Properties:
- Also known as SiLU (Sigmoid Linear Unit)
- Self-gated activation
- Outperforms ReLU in some deep networks
- Used in EfficientNet and other architectures
Mish
Mish activation function.Mish(x) = x * tanh(softplus(x))
Properties:
- Self-regularizing
- Smooth and non-monotonic
- Better than ReLU and Swish in some tasks
- More computationally expensive
Choosing an Activation Function
For Hidden Layers:
- ReLU - Default choice, fast and effective
- LeakyReLU - If dying ReLU is a problem
- GELU - For transformers and attention models
- Swish/Mish - For very deep networks
- Tanh - For RNNs and when zero-centered is important
For Output Layers:
- Softmax - Multi-class classification (mutually exclusive)
- Sigmoid - Binary classification or multi-label
- Linear (none) - Regression tasks
- Tanh - Regression with output in [-1, 1]
Example Network
See Also
- Linear Layer - Fully connected layers
- Loss Functions - Training objectives
- Module - Base class for all layers