Overview
Llama 4 is Meta’s latest generation with industry-leading performance. SGLang has provided first-class support and optimizations for Llama models since v0.4.5.Supported Llama Models
- Llama 4 Scout (109B) - Latest generation
- Llama 4 Maverick (400B) - Largest Llama model
- Llama 3.x series (1B, 3B, 8B, 70B) - Previous generation
- Llama 2 series (7B, 13B, 70B) - Foundation models
- Llama Vision (11B, 90B) - Multimodal variants
- Specialized variants: Classification, Embedding, Reward models
Quick Start
Basic Launch Command
Llama 4 Launch (8xH100/H200)
Llama 4 Configuration
Hardware Recommendations
| Model | Hardware | Context Length | Notes |
|---|---|---|---|
| Scout (109B) | 8×H100 | Up to 1M | Adjust --context-length to avoid OOM |
| Scout (109B) | 8×H200 | Up to 2.5M | Extended context support |
| Scout (109B) + Hybrid KV | 8×H100 | Up to 5M | With --swa-full-tokens-ratio |
| Scout (109B) + Hybrid KV | 8×H200 | Up to 10M | Maximum supported context |
| Maverick (400B) | 8×H200 | Up to 1M | Full precision |
| Maverick (400B) | 8×B200 | - | Optimal performance |
Configuration Tips
Attention Backend Auto-Selection
SGLang automatically selects the optimal attention backend based on your hardware:- Blackwell GPUs (B200/GB200):
trtllm_mha - Hopper GPUs (H100/H200):
fa3(FlashAttention 3) - AMD GPUs:
aiter - Intel XPU:
intel_xpu - Other platforms:
triton(fallback)
Context Length Management
Adjust--context-length to avoid GPU out-of-memory issues:
Hybrid KV Cache
Enable hybrid KV cache for extended context lengths using Llama 4’s local attention layers:Chat Template
For chat completion tasks, add the Llama 4 chat template:Multimodal Support
For Llama Vision models:EAGLE Speculative Decoding
Llama 4 Maverick (400B) supports EAGLE speculative decoding for accelerated inference.Launch with EAGLE
nvidia/Llama-4-Maverick-17B-128E-Eagle3) only recognizes conversations in chat mode.
Benchmarks
Accuracy Test (MMLU Pro)
SGLang achieves accuracy matching or exceeding official benchmarks:| Model | Official Benchmark | SGLang | Hardware |
|---|---|---|---|
| Llama-4-Scout-17B-16E-Instruct | 74.3 | 75.2 | 8×H100 |
| Llama-4-Maverick-17B-128E-Instruct | 80.5 | 80.7 | 8×H100 |
Running Accuracy Tests
Llama-4-Scout
Llama-4-Maverick
Llama 3.x Models
Llama 3.x models (1B, 3B, 8B, 70B) are also fully supported:Specialized Llama Variants
SGLang supports specialized Llama model variants:Embedding Models
Classification Models
Reward Models
Llama Vision (Multimodal)
Llama 3.2 includes vision-enabled variants (11B, 90B). See the Multimodal Models guide for detailed usage.Advanced Features
EAGLE Decoding for Llama 3
Quantization
SGLang supports various quantization methods for Llama models:Resources
Troubleshooting
Out of Memory (OOM)
Reduce--context-length:
