Overview
Mamba is a selective state space model that uses input-dependent dynamics to filter and process sequences. Unlike LTI models with fixed dynamics, Mamba’s state transition matrices (A, B, C) are computed from the input at each timestep, enabling selective memory and forgetting. Mamba achieves state-of-the-art performance on language modeling while maintaining linear scaling with sequence length.Paper Reference
Mamba: Linear-Time Sequence Modeling with Selective State Spaces GitHub: https://github.com/state-spaces/mamba Original paper: https://arxiv.org/abs/2312.00752Installation
Parameters
Model dimension - size of input and output features.
SSM state dimension (N). Typically smaller than LTI models (16-64 vs 64-256).
Convolution kernel size for temporal mixing before SSM. Usually 3-4.
Expansion factor for inner dimension.
d_inner = expand * d_model.Rank for delta (timestep) projection.
'auto' sets it to ceil(d_model / 16).Minimum value for delta initialization.
Maximum value for delta initialization.
Delta initialization method:
'random' or 'constant'.Scale factor for dt initialization.
Minimum floor value for dt initialization.
Whether to use bias in the Conv1d layer.
Whether to use bias in linear projections.
Whether to use fused CUDA kernels when available. Significantly faster.
Layer index for multi-layer caching in inference.
Discretization method:
'mamba', 'zoh', 'bilinear', or 'dirac'.Device for model parameters.
Data type for model parameters.
Usage Example
Basic Usage
Language Modeling Configuration
Autoregressive Inference
Event-Based Processing (Async Mode)
Key Features
Selective State Space
Mamba’s core innovation is input-dependent selection:- Focus on relevant information
- Forget irrelevant details
- Adapt dynamics per timestep
Hardware-Efficient Design
- Conv1d: Short temporal convolution for local mixing
- SSM: Selective state space for long-range dependencies
- Gating: Multiplicative gating for expressiveness
S4D Initialization
Mamba uses S4D-style initialization for A:Fast Path (CUDA Kernels)
Whenuse_fast_path=True and CUDA kernels are available:
Architecture Details
Forward Pass Structure
-
Input Projection:
x, z = split(in_proj(x))- Creates two pathways: SSM and gate
-
Conv1d:
x = conv1d(x)- Short temporal convolution (d_conv=4)
- Causal (no future information)
-
SSM Projection:
delta, B, C = x_proj(x)- Project to get input-dependent parameters
- delta: (B, D, L) - per-channel timesteps
- B: (B, N, L) - input matrix
- C: (B, N, L) - output matrix
-
Selective Scan:
y = selective_scan(x, delta, A, B, C, D)- Core SSM computation
- Uses input-dependent B, C, delta
- Fixed A (learned, but not input-dependent)
-
Gating:
y = y * silu(z)- Multiplicative gating with SiLU
-
Output:
out = out_proj(y)- Final linear projection
Discretization Methods
Mamba supports multiple discretization schemes:Mamba (Default)
Zero-Order Hold (ZOH)
Bilinear
Event-Based Mode (Async)
Whenintegration_timesteps is provided:
State Representation
Mamba maintains:- conv_state: Last d_conv timesteps for causal conv
- lrnn_state: SSM hidden state
- seqlen_offset: Current position (for autoregressive)
Performance Tips
For language modeling, typical values are:
d_state=16(small state)d_conv=4(short convolution)expand=2(2x expansion)
When to Use Mamba
✅ Use Mamba when:- You need selective processing (focus on important info)
- Working on language modeling or NLP
- You want state-of-the-art performance
- Long sequences are common
- You have GPU/CUDA available
- Training speed is critical → S4D (faster training)
- Minimal parameters needed → LRU
- Simpler model preferred → S5 or RG-LRU
Comparison with Other Models
| Model | Type | Selective | Speed (Train) | Speed (Infer) | Performance |
|---|---|---|---|---|---|
| Mamba | LTV | ✅ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| S4D | LTI | ❌ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| RG-LRU | LTV | ✅ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| S7 | LTV | ✅ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
