Overview
S4 (Structured State Space Sequence model) is a foundational LTI model that uses a DPLR (Diagonal Plus Low-Rank) parameterization for efficient computation. It employs complex diagonal matrices with low-rank corrections to capture long-range dependencies while maintaining computational efficiency. S4 uses FFT-based convolution for parallel training and supports efficient autoregressive inference through recurrent mode.Paper Reference
Efficiently Modeling Long Sequences with Structured State Spaces Original implementation: https://github.com/state-spaces/s4Installation
Parameters
Model dimension - size of input and output features.
State dimension (N). Internal dimension of the SSM state space. Higher values increase capacity but also computation.
Maximum sequence length for the kernel. Must be specified for FFT convolution mode.
Number of channels/heads. Allows multi-headed SSM processing.
Reduce dimension of inner layer (e.g., used in GSS). If specified, adds an input linear projection.
Add multiplicative gating (e.g., used in GSS). Creates a gated pathway for enhanced expressiveness.
Activation after final linear layer. Options:
'glu', 'id' (no activation), or None (no linear layer).Standard dropout probability.
Tie dropout mask across sequence length, emulating
nn.Dropout1d.Backbone axis ordering:
(B, H, L) if True, (B, L, H) if False.SSM Configuration
Minimum value for dt (timestep) initialization.
Maximum value for dt initialization.
Tie dt across channels - uses same timestep for all channels.
Transformation to apply to dt. Options:
'exp', 'softplus', etc.Fast dt initialization mode.
Rank of the low-rank correction for DPLR parameterization.
Number of independent SSMs. Defaults to
d_model if not specified.Initialization method for the A matrix. Options:
'legs', 'lin', 'hippo', etc.Use deterministic initialization for reproducibility.
Transformation for the real part of A matrix.
Transformation for the imaginary part of A matrix.
Whether to use real-valued SSMs (instead of complex).
Specific learning rate for SSM parameters. Useful for differential learning rates.
Specific weight decay for SSM parameters.
Print initialization information during setup.
Usage Example
Basic Usage
Autoregressive Inference
With Gating and Bottleneck
Key Features
DPLR Parameterization
S4 uses a Diagonal Plus Low-Rank decomposition:- Λ is a complex diagonal matrix
- P is a low-rank correction matrix
- This enables O(N) computation instead of O(N²)
FFT Convolution
For training, S4 computes the full convolution kernel and uses FFT:Recurrent Mode
For inference, S4 uses recurrent updates:Architecture Details
Forward Pass Structure
- Optional bottleneck projection: Reduce input dimension
- Optional input gating: Multiplicative gating pathway
- SSM convolution: Core state space computation
- D skip connection: Direct input-output connection
- Activation: GELU activation
- Optional output gating: Gate the SSM output
- Output projection: Final linear layer with optional GLU
Initialization
S4 uses specialized initialization:- A matrix: HiPPO or LEGS initialization for long-range memory
- B, C matrices: Random initialization scaled by dimensions
- dt: Log-uniform spacing between dt_min and dt_max
- D: Random initialization
Performance Tips
The
rank parameter controls the expressiveness of the low-rank correction. Higher rank increases capacity but also computation. Rank=1 is usually sufficient.