ModernLLMConfig
Configuration for the custom decoder-only Transformer architecture. The fields mirror decoder-only language models such as GPT (Radford et al., 2018) and LLaMA (Touvron et al., 2023), capturing architectural toggles like RoPE, RMSNorm, SwiGLU, GQA, and MoE discussed in those papers.Required parameters
Size of the vocabulary. Must be positive.
Hidden dimension of the model. Must be positive and divisible by
n_heads.Number of transformer layers. Must be positive.
Number of attention heads. Must divide
d_model evenly.Hidden size of the feedforward layer. Must exceed
d_model.Maximum sequence length for positional encoding. Must be positive.
Optional parameters
Epsilon value for RMSNorm stability. Must be positive.
Dropout probability. Must be in [0, 1).
Standard deviation for weight initialization. Must be positive.
Base frequency for Rotary Position Embeddings (RoPE).
Scaling factor for RoPE to extend context length. Must be positive if provided.
Whether to use Rotary Position Embeddings.
Whether to use attention sinks for improved streaming inference.
Number of attention sink tokens. Must be positive when
use_attention_sinks is True.Whether to use SwiGLU activation in the feedforward layer.
Multiplier for the hidden dimension when using SwiGLU.
Whether to use Grouped Query Attention (GQA).
Number of groups for GQA. Required when
use_gqa is True. Must divide n_heads evenly.Whether to use Mixture-of-Experts feedforward layers.
MoE configuration. Required when
use_moe is True.Whether to share weights between input and output embeddings.
Example
MoEConfig
Configuration for a Mixture-of-Experts feedforward sub-layer.Parameters
Number of expert networks. Must be positive.
Number of experts to route each token to. Must be in [1, num_experts].
Dropout probability for expert outputs. Must be in [0, 1).
Capacity factor for expert load balancing. Must be >= 1.0.
Example
Validation rules
num_expertsmust be positivetop_kmust be between 1 andnum_expertsdropoutmust be in [0, 1)capacity_factormust be >= 1.0