Skip to main content

Overview

LRNNLMHeadModel is a complete language model architecture that combines the LRNN backbone with a language modeling head for next-token prediction tasks.

Class Definition

from lrnnx.architectures import LRNNLMHeadModel

model = LRNNLMHeadModel(
    d_model=512,
    d_state=64,
    n_layer=12,
    vocab_size=50257,
    mixer_types=["LRU", "S5", "attn"] * 4,
    d_intermediate=2048,
    tie_embeddings=True
)

Parameters

d_model
int
required
Model dimension (hidden size).
d_state
int
required
State dimension for the LRNN layers.
n_layer
int
required
Number of layers in the model.
vocab_size
int
required
Size of the vocabulary.
mixer_types
list
required
List of mixer type names for each layer. Must have length equal to n_layer.Available mixer types:
  • "LRU" - Linear Recurrent Unit
  • "S4" - Structured State Space (S4)
  • "S4D" - Diagonal State Space (S4D)
  • "S5" - Simplified State Space (S5)
  • "Centaurus" - Centaurus mixer
  • "Mamba" - Mamba (LTV)
  • "RGLRU" - Recurrent Gated Linear Recurrent Unit
  • "S7" - S7 (LTV)
  • "attn" - Multi-head attention
Example: ["LRU", "S5", "attn", "Mamba", ...]
d_intermediate
int
default:"0"
Intermediate dimension for MLP layers. Set to 0 to disable MLP.
mixer_kwargs
dict
default:"None"
Additional arguments for mixer layers. Can be:
  • A dict mapping mixer type names to their kwargs: {"S5": {"dt_min": 0.001}, "attn": {"num_heads": 8}}
  • A single dict applied to all mixers
mlp_cls
type
default:"None"
MLP class to use. Defaults to GatedMLP.
norm_epsilon
float
default:"1e-5"
Epsilon value for layer normalization.
rms_norm
bool
default:"True"
Whether to use RMSNorm instead of LayerNorm.
fused_add_norm
bool
default:"True"
Whether to use fused add+norm operations (requires Triton kernels).
residual_in_fp32
bool
default:"False"
Whether to compute residuals in float32 precision.
tie_embeddings
bool
default:"True"
Whether to tie input and output embeddings (weight sharing).
pad_vocab_size_multiple
int
default:"8"
Pad vocabulary size to a multiple of this value for efficiency.
initializer_cfg
dict
default:"None"
Configuration for weight initialization.
device
torch.device
default:"None"
Device to place tensors on.
dtype
torch.dtype
default:"None"
Data type for tensors.

Methods

forward

output = model.forward(
    input_ids,
    position_ids=None,
    inference_params=None,
    num_last_tokens=0,
    integration_timesteps=None,
    lengths=None
)
Forward pass of the language model.
input_ids
torch.Tensor
required
Input token IDs of shape (B, L).
position_ids
torch.Tensor
default:"None"
Position IDs (unused, for compatibility).
inference_params
dict
default:"None"
Parameters for inference mode.
num_last_tokens
int
default:"0"
If > 0, only return logits for the last n tokens.
integration_timesteps
torch.Tensor
default:"None"
Timesteps for LTV models (shape: (B, L)).
lengths
torch.Tensor
default:"None"
Sequence lengths for variable-length sequences (shape: (B,)).
output
namedtuple
Returns a CausalLMOutput namedtuple with:
  • logits (torch.Tensor): Logits of shape (B, L, vocab_size)

step

output = model.step(
    input_ids,
    caches,
    integration_timesteps=None
)
Single-step inference for autoregressive generation.
input_ids
torch.Tensor
required
Input token IDs of shape (B, 1) — single token.
caches
dict
required
Dictionary mapping layer indices to their cached states.
integration_timesteps
torch.Tensor
default:"None"
Integration timesteps for LTV models (shape: (B, 1) or (B,)).
output
namedtuple
Returns a CausalLMOutput namedtuple with:
  • logits (torch.Tensor): Logits of shape (B, 1, vocab_size)

allocate_inference_cache

caches = model.allocate_inference_cache(
    batch_size=4,
    max_seqlen=2048,
    dtype=torch.float16
)
Allocate inference cache for autoregressive generation.
batch_size
int
required
Batch size for inference.
max_seqlen
int
required
Maximum sequence length for inference.
dtype
torch.dtype
default:"None"
Data type for cache tensors.
caches
dict
Dictionary mapping layer indices to their allocated caches.

tie_weights

model.tie_weights()
Tie input and output embeddings. This makes the embedding layer and language modeling head share the same weights, which is a common practice to reduce parameters and improve performance.

save_pretrained

model.save_pretrained("./my_model")
Save the model and configuration to a directory.
save_directory
str
required
Directory path where model and config will be saved.

from_pretrained

model = LRNNLMHeadModel.from_pretrained(
    "./my_model",
    device="cuda",
    dtype=torch.bfloat16
)
Load a pretrained model from a directory (class method).
pretrained_model_path
str
required
Path to directory containing saved model and config.
mixer_kwargs
dict
default:"None"
Additional keyword arguments for mixer.
mlp_cls
type
default:"None"
MLP class to use.
initializer_cfg
dict
default:"None"
Configuration for weight initialization.
device
torch.device
default:"None"
Device to place tensors on.
dtype
torch.dtype
default:"None"
Data type for tensors.
model
LRNNLMHeadModel
Loaded model instance.

Example Usage

import torch
from lrnnx.architectures import LRNNLMHeadModel

# Create model
model = LRNNLMHeadModel(
    d_model=512,
    d_state=64,
    n_layer=6,
    vocab_size=50257,
    mixer_types=["LRU", "S5"] * 3,
    d_intermediate=2048,
    tie_embeddings=True
).cuda()

# Training forward pass
input_ids = torch.randint(0, 50257, (2, 128)).cuda()
output = model(input_ids)
logits = output.logits  # (2, 128, 50257)

# Autoregressive generation
caches = model.allocate_inference_cache(batch_size=1, max_seqlen=2048)
token_id = torch.tensor([[1]]).cuda()  # Start token

for _ in range(100):
    output = model.step(token_id, caches)
    next_token = output.logits.argmax(dim=-1)
    token_id = next_token

# Save and load
model.save_pretrained("./my_lrnn_model")
loaded_model = LRNNLMHeadModel.from_pretrained("./my_lrnn_model")

Build docs developers (and LLMs) love