Overview
SwiGLU is a variant of the Gated Linear Unit (GLU) family that uses the Swish activation function. It has become the standard feedforward network activation in modern large language models, replacing traditional activations like ReLU and GELU.Papers:
- Shazeer (2020) - GLU Variants Improve Transformer
- Chowdhery et al. (2022) - PaLM: Scaling Language Modeling with Pathways
Mathematical formulation
SwiGLU equation
Given an inputx, SwiGLU computes:
- W_g: Gate projection (linear layer)
- W_v: Value projection (linear layer)
- W_o: Output projection (linear layer)
- ⊙: Element-wise multiplication (gating)
- swish(x) = x · sigmoid(x): Swish activation function
- Step-by-step computation
- Comparison to other GLUs
- Swish activation
Breaking down the SwiGLU computation:Key insight: The gate controls information flow from the value projection.
Implementation
Modern LLM implements SwiGLU efficiently using PyTorch operations:- Full implementation
- Usage in decoder
- Efficiency optimization
layers.py:58-114
Why SwiGLU works
Gating mechanism
The gating operation provides dynamic, input-dependent control:Information flow control
Information flow control
The gate learns to selectively pass or block information:Advantage over standard activation:
- ReLU/GELU: Fixed decision based only on activation value
- GLU: Separate pathway (value) can inform gating decision
Gradient flow
Gradient flow
SwiGLU provides better gradient flow than standard activations:Two gradient paths:
- Through the gate activation
- Through the value (gated by activation)
Capacity and efficiency
Capacity and efficiency
SwiGLU balances model capacity with computational cost:Traditional FFN (with GELU):SwiGLU:Or equivalently: For same parameter count, SwiGLU is slightly more compute-efficient.
Empirical results
Performance comparison
From Shazeer (2020) - GLU Variants paper:| Model | Activation | WikiText-103 PPL | Params | Training Cost |
|---|---|---|---|---|
| Transformer | GELU | 24.2 | 256M | 1.0× |
| Transformer | ReLU | 25.1 | 256M | 1.0× |
| Transformer | GLU | 23.8 | 256M | 1.15× |
| Transformer | SwiGLU | 23.5 | 256M | 1.15× |
| Transformer | GEGLU | 23.6 | 256M | 1.15× |
SwiGLU achieves the best perplexity with only 15% additional compute compared to standard GELU feedforward.
Adoption in modern LLMs
SwiGLU has been adopted by state-of-the-art models:- PaLM (540B, Google, 2022): First major model to use SwiGLU at scale
- LLaMA (7B-65B, Meta, 2023): Uses SwiGLU exclusively
- LLaMA 2 (7B-70B, Meta, 2023): Continues with SwiGLU
- Falcon (7B-180B, TII, 2023): SwiGLU variant
- Mistral (7B, Mistral AI, 2023): SwiGLU
Configuration
Hidden dimension size
The hidden dimension is typically 4× the model dimension:- Standard ratios
- Capacity vs. efficiency
- Dynamic sizing
Common ratios of
Why 4×?
hidden_features / d_model:| Ratio | Use case | Example |
|---|---|---|
| 2× | Small models, efficiency | d=256, h=512 |
| 2.67× | Matched to standard FFN | d=768, h=2048 |
| 4× | Standard (most common) | d=768, h=3072 |
| 5.33× | High capacity | d=768, h=4096 |
| 8× | Maximum capacity | d=768, h=6144 |
- Historical: Inherited from original Transformer paper
- Empirical: Works well across many tasks and scales
- Balance: Good trade-off between capacity and efficiency
Bias terms
SwiGLU can be used with or without bias terms:Modern trend: Most recent LLMs (LLaMA, PaLM) use
bias=False in all linear layers, including SwiGLU. This:- Reduces parameters by ~0.1%
- Slightly simplifies optimization
- Has negligible impact on final performance
Computational cost
FLOPs analysis
For sequence lengths, model dimension d, hidden dimension h:
Memory usage
Per-layer memory for SwiGLU:| Component | Parameters | Activations (per token) |
|---|---|---|
| Gate projection | d × 2h | 2h |
| Output projection | h × d | d |
| Total | d(2h + h) = 3dh | 2h + d |
- Parameters: 3 × 768 × 3072 = 7,077,888 ≈ 7.1M per layer
- Activations: 2 × 3072 + 768 = 6,912 per token
Common issues and solutions
NaN losses with SwiGLU
NaN losses with SwiGLU
Symptoms: Training loss becomes NaN after some stepsPotential causes:
- Weight initialization too large
- Learning rate too high
- Gradient explosion in deep networks
Memory overflow
Memory overflow
Error: CUDA out of memory during trainingSwiGLU-specific consideration: SwiGLU uses 1.5× parameters of standard FFN for same hidden size.Solutions:
-
Reduce hidden dimension:
-
Use activation checkpointing:
-
Reduce batch size:
Slow training
Slow training
Issue: Training is slower than expected with SwiGLUCheck: Are you using the optimized implementation?Additional optimizations:
- Use
torch.compile()(PyTorch 2.0+) - Enable CUDA graphs for static shapes
- Use fused kernels (e.g.,
xFormerslibrary)
Variants and extensions
GeGLU
GeGLU
Replace Swish with GELU activation:Performance: Slightly worse than SwiGLU (~0.1 PPL) but still better than standard GELU.
ReGLU
ReGLU
Replace Swish with ReLU (most efficient):Performance: Slightly worse than SwiGLU/GeGLU but fastest to compute.
Gated MoE
Gated MoE
Combine SwiGLU with Mixture of Experts:Used in models like Switch Transformer and GLaM.
References
GLU Variants Improve Transformer
Shazeer, 2020 - Original SwiGLU paper with extensive comparisons
PaLM: Scaling Language Modeling with Pathways
Chowdhery et al., 2022 - First major deployment of SwiGLU
LLaMA: Open and Efficient Foundation Language Models
Touvron et al., 2023 - SwiGLU in open-source LLMs
Searching for Activation Functions
Ramachandran et al., 2018 - Discovery of Swish activation
See also
Architecture overview
Learn about the full model architecture
RMSNorm
Efficient normalization that pairs well with SwiGLU
Configuration
Configure FFN hidden size and other hyperparameters