Overview
TheBlock class is a fundamental building block for constructing neural architectures in lrnnx. It wraps a mixer module (e.g., attention, LRNN) with layer normalization and residual connections, optionally including an MLP module.
This implementation has a unique structure compared to standard prenorm Transformer blocks:
- Standard:
LN → MHA/MLP → Add - This Block:
Add → LN → Mixer
Class Definition
Parameters
The hidden dimension size for the block.
The mixer class to instantiate (e.g., MHA, LRNN). This class will be initialized with
dim as its first argument.The MLP class to instantiate, or
nn.Identity if no MLP is desired. When using an MLP, a second normalization layer is applied.The normalization class to use. Supports
nn.LayerNorm and RMSNorm (with Triton acceleration when available).Whether to use Triton fused add and normalization operations for improved performance. Only works with
nn.LayerNorm and RMSNorm.Whether to keep the residual connection in fp32 precision for numerical stability.
Methods
forward
Parameters
The sequence input to the block of shape
(batch_size, seq_len, dim).The residual connection from the previous block. If
None, uses hidden_states as the residual. The computation is: hidden_states = Mixer(LN(residual)).Parameters used during autoregressive generation/inference. Passed to the mixer’s forward method if supported.
Additional keyword arguments passed directly to the underlying mixer module’s forward method.
Returns
The output of the block after applying the mixer (and optionally MLP) of shape
(batch_size, seq_len, dim).The updated residual tensor to be passed to the next block of shape
(batch_size, seq_len, dim).allocate_inference_cache
Parameters
The batch size for inference.
The maximum sequence length for inference.
The data type for the cache tensors. If
None, uses the mixer’s default dtype.Additional keyword arguments to pass to the mixer’s cache allocation method.
Returns
The allocated cache object returned by the mixer. The structure depends on the specific mixer implementation.
Usage Example
Architecture Integration
TheBlock class is designed to be stacked in sequence to build deep architectures:
Notes
- The residual connection pattern (
Add → LN → Mixer) differs from standard prenorm Transformers to enable operator fusion - When
fused_add_norm=True, the implementation uses optimized Triton kernels for better performance on GPUs - Some mixers (e.g., S4, S4D) may return
(output, state)tuples; the Block automatically handles this by extracting the output - The MLP is applied after the mixer with its own normalization layer when
mlp_clsis notnn.Identity
