Overview
LRNNLMHeadModel is a complete language model architecture that combines the LRNN backbone with a language modeling head for next-token prediction tasks.
Class Definition
Parameters
Model dimension (hidden size).
State dimension for the LRNN layers.
Number of layers in the model.
Size of the vocabulary.
List of mixer type names for each layer. Must have length equal to
n_layer.Available mixer types:"LRU"- Linear Recurrent Unit"S4"- Structured State Space (S4)"S4D"- Diagonal State Space (S4D)"S5"- Simplified State Space (S5)"Centaurus"- Centaurus mixer"Mamba"- Mamba (LTV)"RGLRU"- Recurrent Gated Linear Recurrent Unit"S7"- S7 (LTV)"attn"- Multi-head attention
["LRU", "S5", "attn", "Mamba", ...]Intermediate dimension for MLP layers. Set to 0 to disable MLP.
Additional arguments for mixer layers. Can be:
- A dict mapping mixer type names to their kwargs:
{"S5": {"dt_min": 0.001}, "attn": {"num_heads": 8}} - A single dict applied to all mixers
MLP class to use. Defaults to
GatedMLP.Epsilon value for layer normalization.
Whether to use RMSNorm instead of LayerNorm.
Whether to use fused add+norm operations (requires Triton kernels).
Whether to compute residuals in float32 precision.
Whether to tie input and output embeddings (weight sharing).
Pad vocabulary size to a multiple of this value for efficiency.
Configuration for weight initialization.
Device to place tensors on.
Data type for tensors.
Methods
forward
Input token IDs of shape
(B, L).Position IDs (unused, for compatibility).
Parameters for inference mode.
If > 0, only return logits for the last n tokens.
Timesteps for LTV models (shape:
(B, L)).Sequence lengths for variable-length sequences (shape:
(B,)).Returns a
CausalLMOutput namedtuple with:logits(torch.Tensor): Logits of shape(B, L, vocab_size)
step
Input token IDs of shape
(B, 1) — single token.Dictionary mapping layer indices to their cached states.
Integration timesteps for LTV models (shape:
(B, 1) or (B,)).Returns a
CausalLMOutput namedtuple with:logits(torch.Tensor): Logits of shape(B, 1, vocab_size)
allocate_inference_cache
Batch size for inference.
Maximum sequence length for inference.
Data type for cache tensors.
Dictionary mapping layer indices to their allocated caches.
tie_weights
save_pretrained
Directory path where model and config will be saved.
from_pretrained
Path to directory containing saved model and config.
Additional keyword arguments for mixer.
MLP class to use.
Configuration for weight initialization.
Device to place tensors on.
Data type for tensors.
Loaded model instance.
