Overview
S7 (Selective and Simplified State Space Layer) is an LTV model that combines input-dependent state space parameters with a simplified architecture. It uses HiPPO initialization for the state transition matrix and features time-varying A, B, C, and D parameters.
Key features:
- Fully time-varying parameters (A, B, C, D all depend on input)
- HiPPO-based initialization for stable long-range dependencies
- No convolution layer (simpler than Mamba/RGLRU)
- Gated output with residual connection
- Custom discretization scheme
Import
Class Signature
Constructor
__init__
Model dimension (input/output dimension).
State dimension. Must be divisible by
J.Number of blocks for HiPPO initialization. The state space is divided into
J blocks, each initialized with HiPPO parameters.Whether to use the CUDA fast path if available. Enables fused kernel implementation for better performance.
Layer index for multi-layer models, used for caching in stacked architectures.
Device for the model parameters. If
None, uses default device.Data type for the model parameters. If
None, uses default dtype.Methods
forward
Input tensor of shape
(B, L, H) where:B= batch sizeL= sequence lengthH= model dimension (d_model)
Timesteps for async/event-driven discretization. Shape
(B, L). Currently unused but kept for interface compatibility.Lengths of sequences for variable-length batches. Shape
(B,). Currently unused.Cache for autoregressive generation. If provided, must contain:
"lrnn_state": S7 state tensor"seqlen_offset": Current position in sequence
Output tensor of shape
(B, L, H).step
Input at current timestep, shape
(B, 1, H).Cache dictionary containing:
"lrnn_state": S7 state, shape(B, N)where N =d_state"seqlen_offset": Current position in sequence
Additional keyword arguments (unused).
A tuple containing:
- Output tensor at the current timestep, shape
(B, 1, H) - Updated cache dictionary
allocate_inference_cache
The batch size for inference.
Maximum sequence length. Unused but kept for interface consistency.
Data type for allocated tensors. If
None, uses model’s parameter dtype.Additional keyword arguments (unused).
Cache dictionary containing:
"lrnn_state": Zero-initialized state, shape(B, N)"seqlen_offset": Position counter, initialized to 0
Examples
Basic Usage
Multi-Block Initialization
Autoregressive Inference
Large State Space
Architecture Details
Time-Varying Parameters
S7 computes all SSM parameters from the input:Residual and Gating
The output includes gating and a residual connection:HiPPO Initialization
S7 uses HiPPO (High-order Polynomial Projection Operators) initialization for the base transition matrix, which provides:- Stable long-range dependencies
- Theoretically-grounded initialization
- Better out-of-the-box performance
State Space Dimensions
Unlike Mamba (typically d_state=16) or RGLRU (d_state=1), S7 commonly uses larger state dimensions (e.g., 64-256) to increase model capacity. The state dimension should be divisible by the number of blocksJ.
References
- S7: Selective and Simplified State Space Layers for Sequence Modeling
- HiPPO: Recurrent Memory with Optimal Polynomial Projections
