Overview
S4D (Structured State Space - Diagonal) is a simplified variant of S4 that uses a diagonal parameterization instead of the DPLR (Diagonal Plus Low-Rank) structure. This makes it faster and easier to implement while maintaining competitive performance. The key difference from S4 is that S4D uses a pure diagonal matrix for A, eliminating the low-rank correction term P.Paper Reference
On the Parameterization and Initialization of Diagonal State Space Models Original implementation: https://github.com/state-spaces/s4Installation
Parameters
Model dimension - size of input and output features.
State dimension (N). Internal dimension of the SSM state space.
Maximum sequence length for the kernel. Required for FFT convolution.
Number of channels/heads for multi-headed processing.
Reduce dimension of inner layer. If specified, adds input projection.
Add multiplicative gating for enhanced expressiveness.
Activation after final linear layer:
'glu', 'id', or None.Dropout probability.
Tie dropout mask across sequence length.
Input format:
(B, H, L) if True, (B, L, H) if False.SSM Configuration
Minimum timestep value.
Maximum timestep value.
Tie dt across all channels.
Transform applied to dt parameter.
Rank parameter (kept for compatibility, unused in diagonal mode).
Number of independent SSMs. Defaults to d_model.
Initialization method for A matrix:
'legs', 'hippo', etc.Use deterministic initialization.
Transform for real part of A.
Transform for imaginary part of A.
Use real-valued (instead of complex) SSM.
S4D-specific: Discretization method. Options:
'zoh', 'bilinear'.Learning rate for SSM parameters.
Weight decay for SSM parameters.
Print initialization information.
Usage Example
Basic Usage
With Custom Discretization
Autoregressive Inference
Multi-Channel Configuration
Key Differences from S4
Diagonal Parameterization
S4D uses a pure diagonal matrix:- ✅ Faster computation
- ✅ Simpler implementation
- ✅ Easier to tune
- ⚠️ Slightly less expressive (but usually negligible)
Discretization Options
S4D explicitly supports multiple discretization methods via thedisc parameter:
- ZOH (Zero-Order Hold): Default, good general-purpose discretization
- Bilinear: Better frequency response preservation
Architecture Details
Forward Pass
Same structure as S4:- Optional bottleneck/gating
- FFT-based SSM convolution (training)
- D skip connection
- GELU activation
- Output projection
Diagonal SSM
The diagonal structure enables efficient computation:Initialization
- A: Diagonal complex matrix with HiPPO/LEGS initialization
- B, C: Random Gaussian scaled by dimensions
- dt: Log-spaced between dt_min and dt_max
- D: Random initialization
Performance Comparison
| Model | Speed | Memory | Performance |
|---|---|---|---|
| S4 | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| S4D | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
Performance Tips
The
disc parameter allows you to choose discretization. ZOH is recommended for most tasks, but bilinear can be better for frequency-domain applications.When to Use S4D
✅ Use S4D when:- You want a simple, efficient SSM
- Training speed is important
- You don’t need the extra expressiveness of DPLR
- You’re working with regular, fixed-interval sequences
- You need input-dependent dynamics → Use Mamba
- You need ultra-minimal parameters → Use LRU
- You want to experiment with DPLR → Use S4
