TopKRouter
Top-k gating distribution for routing tokens to experts. Implements the routing mechanism from Shazeer et al. (2017) that computes gating scores and selects the top-k experts for each token.Mathematical formulation
Constructor
Input dimension matching the hidden states. Must be positive.
MoE configuration containing num_experts, top_k, and other routing parameters.
Attributes
Input feature dimension.
MoE configuration object.
Linear layer projecting from dim to num_experts (no bias).
forward
Input tensor of shape (batch, seq_len, dim).
Returns
Top-k gating scores of shape (batch, seq_len, top_k). Values sum to 1 per token.
Expert indices of shape (batch, seq_len, top_k) with values in range [0, num_experts).
Status
The routing logic will be implemented in a future phase. Currently raises NotImplementedError.
MixtureOfExperts
Sparse mixture of experts layer that routes tokens to specialized sub-networks. Follows the Switch Transformer architecture (Fedus et al., 2021) where each token is processed by only top-k experts rather than all experts, enabling efficient model scaling.Architecture
For each token:- Router computes gating scores for all experts
- Select top-k experts based on scores
- Process token through selected experts
- Combine expert outputs weighted by routing scores
Constructor
Model dimension for input/output. Must be positive.
MoE configuration containing:
- num_experts: Number of expert networks
- top_k: Number of experts to activate per token
- capacity_factor: Expert capacity as multiple of average tokens per expert
- expert_capacity: Optional fixed capacity per expert
Attributes
Model dimension.
MoE configuration object.
Token routing module that selects top-k experts.
List of num_experts feedforward networks. Each expert is a simple 2-layer MLP with GELU activation:
- Linear: dim -> dim * 4
- GELU activation
- Linear: dim * 4 -> dim
forward
Input tensor of shape (batch, seq_len, dim).
Returns
Expert-processed outputs of shape (batch, seq_len, dim).
Status
The forward pass with routing logic will be implemented in a future phase. Currently raises NotImplementedError.
Example usage (when implemented)
Integration with transformer
Benefits of MoE
- Efficient scaling: Increase model capacity without proportional compute increase
- Sparse computation: Each token uses only top-k experts (e.g., 2 out of 8)
- Specialization: Different experts can specialize in different types of content
- Memory efficiency: Experts can be distributed across devices