Skip to main content

Overview

S4D (Structured State Space - Diagonal) is a simplified variant of S4 that uses a diagonal parameterization instead of the DPLR (Diagonal Plus Low-Rank) structure. This makes it faster and easier to implement while maintaining competitive performance. The key difference from S4 is that S4D uses a pure diagonal matrix for A, eliminating the low-rank correction term P.

Paper Reference

On the Parameterization and Initialization of Diagonal State Space Models Original implementation: https://github.com/state-spaces/s4

Installation

from lrnnx.models.lti import S4D

Parameters

d_model
int
required
Model dimension - size of input and output features.
d_state
int
default:"64"
State dimension (N). Internal dimension of the SSM state space.
l_max
int
default:"None"
Maximum sequence length for the kernel. Required for FFT convolution.
channels
int
default:"1"
Number of channels/heads for multi-headed processing.
bottleneck
int
default:"None"
Reduce dimension of inner layer. If specified, adds input projection.
gate
int
default:"None"
Add multiplicative gating for enhanced expressiveness.
final_act
str
default:"'glu'"
Activation after final linear layer: 'glu', 'id', or None.
dropout
float
default:"0.0"
Dropout probability.
tie_dropout
bool
default:"False"
Tie dropout mask across sequence length.
transposed
bool
default:"True"
Input format: (B, H, L) if True, (B, L, H) if False.

SSM Configuration

dt_min
float
default:"0.001"
Minimum timestep value.
dt_max
float
default:"0.1"
Maximum timestep value.
dt_tie
bool
default:"True"
Tie dt across all channels.
dt_transform
str
default:"'exp'"
Transform applied to dt parameter.
rank
int
default:"1"
Rank parameter (kept for compatibility, unused in diagonal mode).
n_ssm
int
default:"None"
Number of independent SSMs. Defaults to d_model.
init
str
default:"'legs'"
Initialization method for A matrix: 'legs', 'hippo', etc.
deterministic
bool
default:"False"
Use deterministic initialization.
real_transform
str
default:"'exp'"
Transform for real part of A.
imag_transform
str
default:"'none'"
Transform for imaginary part of A.
is_real
bool
default:"False"
Use real-valued (instead of complex) SSM.
disc
str
default:"'zoh'"
S4D-specific: Discretization method. Options: 'zoh', 'bilinear'.
lr
float
default:"None"
Learning rate for SSM parameters.
wd
float
default:"0.0"
Weight decay for SSM parameters.
verbose
bool
default:"True"
Print initialization information.

Usage Example

Basic Usage

import torch
from lrnnx.models.lti import S4D

# Create S4D model
model = S4D(d_model=64, d_state=64, l_max=1024)

# Forward pass
x = torch.randn(2, 1024, 64)  # (batch, length, features)
y, state = model(x)

print(y.shape)  # torch.Size([2, 1024, 64])

With Custom Discretization

model = S4D(
    d_model=128,
    d_state=64,
    l_max=2048,
    disc="bilinear",  # Use bilinear discretization
    dt_min=0.0001,
    dt_max=0.1,
)

x = torch.randn(4, 2048, 128)
y, state = model(x)

Autoregressive Inference

import torch
from lrnnx.models.lti import S4D

model = S4D(d_model=64, d_state=64, l_max=1024)
batch_size = 2

# Allocate inference cache
cache = model.allocate_inference_cache(batch_size=batch_size)

# Generate sequence autoregressively
for t in range(100):
    x_t = torch.randn(batch_size, 64)
    y_t, cache = model.step(x_t, cache)
    # y_t.shape: (batch_size, 64)

Multi-Channel Configuration

model = S4D(
    d_model=256,
    d_state=64,
    l_max=4096,
    channels=8,        # 8-headed SSM
    dropout=0.1,
    disc="zoh",
)

x = torch.randn(2, 4096, 256)
y, state = model(x)

Key Differences from S4

Diagonal Parameterization

S4D uses a pure diagonal matrix:
A = Λ  (diagonal only, no low-rank term)
Versus S4’s DPLR:
A = Λ - PP*  (diagonal plus low-rank)
This simplification:
  • ✅ Faster computation
  • ✅ Simpler implementation
  • ✅ Easier to tune
  • ⚠️ Slightly less expressive (but usually negligible)

Discretization Options

S4D explicitly supports multiple discretization methods via the disc parameter:
  • ZOH (Zero-Order Hold): Default, good general-purpose discretization
  • Bilinear: Better frequency response preservation

Architecture Details

Forward Pass

Same structure as S4:
  1. Optional bottleneck/gating
  2. FFT-based SSM convolution (training)
  3. D skip connection
  4. GELU activation
  5. Output projection

Diagonal SSM

The diagonal structure enables efficient computation:
# Element-wise operations instead of matrix multiplication
h_t = A * h_{t-1} + B @ x_t  # A is diagonal
y_t = C @ h_t + D @ x_t

Initialization

  • A: Diagonal complex matrix with HiPPO/LEGS initialization
  • B, C: Random Gaussian scaled by dimensions
  • dt: Log-spaced between dt_min and dt_max
  • D: Random initialization

Performance Comparison

ModelSpeedMemoryPerformance
S4⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
S4D⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
S4D is typically 10-20% faster than S4 with similar performance.

Performance Tips

S4D is a great default choice - simpler than S4 with similar performance and faster training.
The disc parameter allows you to choose discretization. ZOH is recommended for most tasks, but bilinear can be better for frequency-domain applications.
Like all LTI models, S4D does not support variable timesteps. For event-driven data, use Mamba or S7.

When to Use S4D

Use S4D when:
  • You want a simple, efficient SSM
  • Training speed is important
  • You don’t need the extra expressiveness of DPLR
  • You’re working with regular, fixed-interval sequences
Consider alternatives when:
  • You need input-dependent dynamics → Use Mamba
  • You need ultra-minimal parameters → Use LRU
  • You want to experiment with DPLR → Use S4

See Also

  • S4 - Original DPLR variant
  • S5 - Even simpler implementation
  • LRU - Minimal diagonal SSM

Build docs developers (and LLMs) love