LRNNLMHeadModel

Overview

LRNNLMHeadModel is a complete language model architecture that combines the LRNN backbone with a language modeling head for next-token prediction tasks.

Class Definition

from lrnnx.architectures import LRNNLMHeadModel

model = LRNNLMHeadModel(
    d_model=512,
    d_state=64,
    n_layer=12,
    vocab_size=50257,
    mixer_types=["LRU", "S5", "attn"] * 4,
    d_intermediate=2048,
    tie_embeddings=True
)

Parameters

d_model

int

required

Model dimension (hidden size).

d_state

int

required

State dimension for the LRNN layers.

n_layer

int

required

Number of layers in the model.

vocab_size

int

required

Size of the vocabulary.

mixer_types

list

required

List of mixer type names for each layer. Must have length equal to n_layer.Available mixer types:

"LRU" - Linear Recurrent Unit
"S4" - Structured State Space (S4)
"S4D" - Diagonal State Space (S4D)
"S5" - Simplified State Space (S5)
"Centaurus" - Centaurus mixer
"Mamba" - Mamba (LTV)
"RGLRU" - Recurrent Gated Linear Recurrent Unit
"S7" - S7 (LTV)
"attn" - Multi-head attention

Example: ["LRU", "S5", "attn", "Mamba", ...]

d_intermediate

int

default:"0"

Intermediate dimension for MLP layers. Set to 0 to disable MLP.

mixer_kwargs

dict

default:"None"

Additional arguments for mixer layers. Can be:

A dict mapping mixer type names to their kwargs: {"S5": {"dt_min": 0.001}, "attn": {"num_heads": 8}}
A single dict applied to all mixers

mlp_cls

type

default:"None"

MLP class to use. Defaults to GatedMLP.

norm_epsilon

float

default:"1e-5"

Epsilon value for layer normalization.

rms_norm

bool

default:"True"

Whether to use RMSNorm instead of LayerNorm.

fused_add_norm

bool

default:"True"

Whether to use fused add+norm operations (requires Triton kernels).

residual_in_fp32

bool

default:"False"

Whether to compute residuals in float32 precision.

tie_embeddings

bool

default:"True"

Whether to tie input and output embeddings (weight sharing).

pad_vocab_size_multiple

int

default:"8"

Pad vocabulary size to a multiple of this value for efficiency.

initializer_cfg

dict

default:"None"

Configuration for weight initialization.

device

torch.device

default:"None"

Device to place tensors on.

dtype

torch.dtype

default:"None"

Data type for tensors.

Methods

forward

output = model.forward(
    input_ids,
    position_ids=None,
    inference_params=None,
    num_last_tokens=0,
    integration_timesteps=None,
    lengths=None
)

Forward pass of the language model.

input_ids

torch.Tensor

required

Input token IDs of shape (B, L).

position_ids

torch.Tensor

default:"None"

Position IDs (unused, for compatibility).

inference_params

dict

default:"None"

Parameters for inference mode.

num_last_tokens

int

default:"0"

If > 0, only return logits for the last n tokens.

integration_timesteps

torch.Tensor

default:"None"

Timesteps for LTV models (shape: (B, L)).

lengths

torch.Tensor

default:"None"

Sequence lengths for variable-length sequences (shape: (B,)).

output

namedtuple

Returns a CausalLMOutput namedtuple with:

logits (torch.Tensor): Logits of shape (B, L, vocab_size)

step

output = model.step(
    input_ids,
    caches,
    integration_timesteps=None
)

Single-step inference for autoregressive generation.

input_ids

torch.Tensor

required

Input token IDs of shape (B, 1) — single token.

caches

dict

required

Dictionary mapping layer indices to their cached states.

integration_timesteps

torch.Tensor

default:"None"

Integration timesteps for LTV models (shape: (B, 1) or (B,)).

output

namedtuple

Returns a CausalLMOutput namedtuple with:

logits (torch.Tensor): Logits of shape (B, 1, vocab_size)

allocate_inference_cache

caches = model.allocate_inference_cache(
    batch_size=4,
    max_seqlen=2048,
    dtype=torch.float16
)

Allocate inference cache for autoregressive generation.

batch_size

int

required

Batch size for inference.

max_seqlen

int

required

Maximum sequence length for inference.

dtype

torch.dtype

default:"None"

Data type for cache tensors.

caches

dict

Dictionary mapping layer indices to their allocated caches.

tie_weights

model.tie_weights()

Tie input and output embeddings. This makes the embedding layer and language modeling head share the same weights, which is a common practice to reduce parameters and improve performance.

save_pretrained

model.save_pretrained("./my_model")

Save the model and configuration to a directory.

save_directory

str

required

Directory path where model and config will be saved.

from_pretrained

model = LRNNLMHeadModel.from_pretrained(
    "./my_model",
    device="cuda",
    dtype=torch.bfloat16
)

Load a pretrained model from a directory (class method).

pretrained_model_path

str

required

Path to directory containing saved model and config.

mixer_kwargs

dict

default:"None"

Additional keyword arguments for mixer.

mlp_cls

type

default:"None"

MLP class to use.

initializer_cfg

dict

default:"None"

Configuration for weight initialization.

device

torch.device

default:"None"

Device to place tensors on.

dtype

torch.dtype

default:"None"

Data type for tensors.

model

LRNNLMHeadModel

Loaded model instance.

Example Usage

import torch
from lrnnx.architectures import LRNNLMHeadModel

# Create model
model = LRNNLMHeadModel(
    d_model=512,
    d_state=64,
    n_layer=6,
    vocab_size=50257,
    mixer_types=["LRU", "S5"] * 3,
    d_intermediate=2048,
    tie_embeddings=True
).cuda()

# Training forward pass
input_ids = torch.randint(0, 50257, (2, 128)).cuda()
output = model(input_ids)
logits = output.logits  # (2, 128, 50257)

# Autoregressive generation
caches = model.allocate_inference_cache(batch_size=1, max_seqlen=2048)
token_id = torch.tensor([[1]]).cuda()  # Start token

for _ in range(100):
    output = model.step(token_id, caches)
    next_token = output.logits.argmax(dim=-1)
    token_id = next_token

# Save and load
model.save_pretrained("./my_lrnn_model")
loaded_model = LRNNLMHeadModel.from_pretrained("./my_lrnn_model")

LTI Models

LTV Models

Core

Architectures

Layers

Operations

LRNNLMHeadModel

Overview

Class Definition

Parameters

Methods

forward

step

allocate_inference_cache

tie_weights

save_pretrained

from_pretrained

Example Usage

Build docs developers (and LLMs) love

LTI Models

LTV Models

Core

Architectures

Layers

Operations

​Overview

​Class Definition

​Parameters

​Methods

​forward

​step

​allocate_inference_cache

​tie_weights

​save_pretrained

​from_pretrained

​Example Usage

Build docs developers (and LLMs) love

Overview

Class Definition

Parameters

Methods

forward

step

allocate_inference_cache

tie_weights

save_pretrained

from_pretrained

Example Usage