Overview
Centaurus is an advanced state space model that introduces intra-state mode mixing through sub-states. Unlike traditional SSMs with a single state dimension, Centaurus decomposes each state into multiple sub-states that interact through a mixing matrix. Centaurus supports four different modes (neck, DWS, full, pointwise) that control how input and output channels interact with the state space, allowing flexible architecture design.Paper Reference
Centaurus: Let SSMs be Conv Nets https://openreview.net/forum?id=PkpNRmBZ32Installation
Parameters
Model dimension - size of input and output features.
Number of state channels. Each channel has
sub_state_dim sub-states.Number of sub-states per state channel. Controls intra-state mixing capacity.
Discretization method. Currently only ZOH is fully supported.
Architecture mode controlling channel interaction:
'neck': Bottleneck with dense projections'dws': Depthwise-separable (one state per channel)'full': Fully connected (state per input-output pair)'pointwise'/'pw'/'s5': Pointwise bottleneck (flattened sub-states)
Usage Example
Basic Usage
Different Modes
Using the Wrapper
Autoregressive Inference
Architecture Modes
Neck Mode (Bottleneck)
Use: General-purpose, balanced performance- Dense input projection:
B: (d_state, d_model) - Dense output projection:
C: (d_model, d_state) - Bottleneck through state dimension
DWS Mode (Depthwise-Separable)
Use: Parameter efficiency, one state per channel- Diagonal projections: one state per input/output channel
d_statemust equald_model- Minimal parameters
Full Mode (Fully Connected)
Use: Maximum expressiveness, state per connection- State for each (input, output) pair
d_state = d_model * d_model- Highest capacity but most parameters
Pointwise Mode (Flattened Sub-states)
Use: Simplified variant, no E-mixing- Flattens (d_state, sub_state_dim) into single dimension
- No mixing matrix E
- Delta shared across sub-states
Key Features
Intra-State Mixing
Centaurus’s key innovation is sub-state decomposition:ZOH Discretization
Centaurus uses implicit Zero-Order Hold discretization:Learned Time Scales
Each state channel has a learnable delta (timestep):State Representation
Most Modes (Neck, DWS, Full)
- Each of
d_statechannels hassub_state_dimsub-states - Complex-valued for frequency modeling
- Mixed via matrix E before output
Pointwise Mode
- Flattened representation
- No E-mixing
- Simpler but less structured
Parameter Count Comparison
| Mode | Parameters | Use Case |
|---|---|---|
| DWS | Minimal | Efficiency-critical |
| Neck | Moderate | General purpose |
| Pointwise | Moderate | Simplified variant |
| Full | Maximum | Expressiveness-critical |
- Neck: ~
d_state * d_model * 2+ smaller terms - DWS: ~
2 * d_model(when d_state=d_model) - Full: ~
d_state(but d_state = d_model^2) - Pointwise: ~
(d_state * sub_state_dim) * d_model * 2
Performance Tips
The
sub_state_dim parameter controls the multi-scale capacity. Values of 4-16 typically work well.When to Use Each Mode
Neck Mode
✅ General-purpose tasks✅ Balanced performance/efficiency
✅ When d_state < d_model (bottleneck)
DWS Mode
✅ Parameter efficiency critical✅ Depthwise processing sufficient
✅ When d_model is moderate
Full Mode
✅ Maximum expressiveness needed✅ Small d_model (e.g., 16-32)
✅ Complex input-output relationships
Pointwise Mode
✅ Simplified implementation✅ When E-mixing not needed
✅ Compatibility with S5-style code
