Transformer Architecture
Llama 2 uses an optimized transformer architecture with several key innovations that improve performance and efficiency. The architecture is available in three model sizes: 7B, 13B, and 70B parameters.Model Sizes
| Model | Parameters | Heads | KV Heads | Context Length | GQA |
|---|---|---|---|---|---|
| Llama 2 7B | 7B | 32 | 32 | 4096 | No |
| Llama 2 13B | 13B | 40 | 40 | 4096 | No |
| Llama 2 70B | 70B | 64 | 8 | 4096 | Yes |
Architecture Components
ModelArgs Configuration
The model architecture is defined throughModelArgs, which specifies the hyperparameters:
dim: Model dimension (4096 for base models)n_layers: Number of transformer layers (32 for 7B/13B, more for 70B)n_heads: Number of attention headsn_kv_heads: Number of key-value heads (for GQA)max_seq_len: Maximum sequence length during training (context window is 4096)