Overview
RGLRU (Recurrent Gated Linear Recurrent Unit) is an LTV model based on the Griffin architecture. It uses a gated linear recurrence with learnable parameters for efficient sequence modeling.
Key features:
- Simple 1D state space (d_state=1)
- Dual-stream architecture: gate path and recurrent path
- Learnable recurrence base parameter in (0, 1)
- Causal temporal convolution for local context
- Efficient inference with state caching
Import
Class Signature
Constructor
__init__
Model dimension (input/output dimension).
Temporal convolution kernel size for local context modeling.
Expansion factor for inner dimension. Inner dimension =
d_model * expand.Fixed scalar for recurrent gate scaling. Controls the effective timescale of the recurrence.
Tuple
(lo, hi) defining the initialization range for the recurrence base parameter a in (0, 1). Values closer to 1 create longer-range dependencies.Whether the Conv1D layer uses a bias term.
Whether Linear projections use bias terms.
Use the fused CUDA kernel when available for improved performance.
Layer index for multi-layer caching in stacked architectures.
Device for parameters. If
None, uses default device.Data type for parameters. If
None, uses default dtype.Methods
forward
Input tensor of shape
(B, L, D) where:B= batch sizeL= sequence lengthD= model dimension (d_model)
Currently unused. Kept for LTV interface compatibility.
Currently unused. Kept for interface compatibility.
Cache dict for autoregressive generation. If provided, must contain:
"conv_state": Convolution state tensor"lrnn_state": RG-LRU state tensor"seqlen_offset": Current position in sequence
Output tensor of shape
(B, L, D).step
Input tensor of shape
(B, 1, D) for single timestep.Cache dictionary containing:
"conv_state": Convolution state, shape(B, D_inner, d_conv)"lrnn_state": RG-LRU state, shape(B, D_inner, 1)"seqlen_offset": Current position in sequence
Additional keyword arguments (unused).
Tuple containing:
- Output tensor of shape
(B, 1, D) - Updated cache dictionary
allocate_inference_cache
Batch size for inference.
Maximum sequence length. Unused but kept for interface consistency.
Data type for cache tensors. If
None, uses model’s parameter dtype.Additional keyword arguments (unused).
Cache dictionary containing:
"conv_state": Shape(B, D_inner, d_conv)"lrnn_state": Shape(B, D_inner, 1)"seqlen_offset": Initialized to 0
Examples
Basic Usage
Custom Initialization
Autoregressive Inference
Multi-Layer Stack
Architecture Details
Dual-Stream Processing
RGLRU processes input through two parallel streams:- Gate stream:
Linear → GeLU- produces multiplicative gating - Recurrent stream:
Linear → Conv1D → RG-LRU- performs temporal processing
Recurrence Equation
The RG-LRU recurrence is:ais the learnable base parameter in (0, 1)cis the fixed scaling constantσis the sigmoid function
