Overview
RG-LRU (Recurrent Gated Linear Recurrent Unit) is a gated state space model from the Griffin architecture that combines the efficiency of diagonal LRU with input-dependent gating. It achieves competitive performance with Mamba while being simpler and faster. RG-LRU uses recurrent gates to modulate state updates and input gates to filter inputs, enabling selective processing without the complexity of full input-dependent matrices.Paper Reference
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models https://arxiv.org/abs/2402.19427Installation
Parameters
Model dimension - size of input and output features.
Temporal convolution kernel size. Typically 3-4 for local mixing.
Expansion factor for inner dimension.
d_inner = expand * d_model. Usually 1 for RG-LRU.Fixed scalar for recurrent gate scaling. Controls the range of gate values.
Tuple
(lo, hi) for initializing the recurrence base a uniformly in this range within (0, 1). Higher values enable longer memory.Whether the Conv1d layer uses a bias term.
Whether linear projections use bias.
Use fused CUDA kernel when available for significant speedup.
Layer index for multi-layer caching during inference.
Device for model parameters.
Data type for model parameters.
Usage Example
Basic Usage
Language Modeling Configuration
Autoregressive Inference
Custom Memory Range
Key Features
Gated Recurrence
RG-LRU uses two gates:Diagonal Recurrence
Like LRU, uses diagonal state transitions:sqrt_term normalizes to maintain unit variance.
Two-Stream Architecture
Architecture Details
Forward Pass
-
Gate Stream:
gate = gelu(gate_proj(x))- Simple gating pathway
-
RG-LRU Stream:
- Input Projection:
x = in_proj(x) - Conv1d:
x = conv1d(x)(causal) - Gate Projections: Compute recurrent_gate and input_gate
- Gated Scan: Update state with gated recurrence
- Input Projection:
-
Merge:
out = out_proj(gate * y)- Multiply streams and project
Recurrent Update
The core RG-LRU recurrence:d_state=1 (scalar state per channel).
Initialization
-
a (base): Uniform in
[a_init_range[0], a_init_range[1]]- Typically (0.9, 0.999)
- Higher values → longer memory
-
Gate projections: Standard initialization
- With bias (important for gates)
-
c (scaling): Fixed constant (default 8.0)
- Not learned
- Controls gate range
Performance Characteristics
Speed
✅ Fast:- Simpler than Mamba
- Diagonal structure (element-wise ops)
- Efficient CUDA kernels available
Memory
✅ Efficient:d_state=1(minimal state)- No large intermediate tensors
- Small cache for inference
Accuracy
✅ Competitive:- Close to Mamba on many tasks
- Strong on language modeling
- Scales well with model size
Comparison with Other Models
| Model | Gating | State Dim | Speed | Accuracy |
|---|---|---|---|---|
| RG-LRU | ✅ Recurrent | 1 | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Mamba | ✅ Selective | 16 | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| LRU | ❌ | N | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| S7 | ✅ | N | ⭐⭐⭐ | ⭐⭐⭐⭐ |
Performance Tips
The
c parameter controls gate scaling. Default c=8.0 works well, but you can experiment with 4.0-16.0 for different behaviors.When to Use RG-LRU
✅ Use RG-LRU when:- You want gated/selective processing
- Simpler than Mamba is desired
- Fast training is important
- Working on language modeling
- You need a strong baseline
Advanced Usage
Stacked Architecture
Hybrid with Attention
Custom Memory Configuration
RG-LRU vs LRU
Key Differences
| Aspect | LRU | RG-LRU |
|---|---|---|
| Type | LTI | LTV |
| Gating | None | Recurrent + Input |
| State dim | d_state | 1 (per channel) |
| Dynamics | Fixed | Input-dependent |
| Expressiveness | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Speed (train) | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Use case | General | Language |
When to Use Which
Use LRU if:- Simplicity is key
- LTI model is sufficient
- Maximum speed needed
- Need selective processing
- Language modeling
- Input-dependent dynamics
