Llama Models - SGLang

Meta’s Llama series represents one of the most widely-used families of open-source large language models, ranging from 7B to 400B parameters across Llama 2, Llama 3, and Llama 4 generations.

Overview

Llama 4 is Meta’s latest generation with industry-leading performance. SGLang has provided first-class support and optimizations for Llama models since v0.4.5.

Supported Llama Models

Llama 4 Scout (109B) - Latest generation
Llama 4 Maverick (400B) - Largest Llama model
Llama 3.x series (1B, 3B, 8B, 70B) - Previous generation
Llama 2 series (7B, 13B, 70B) - Foundation models
Llama Vision (11B, 90B) - Multimodal variants
Specialized variants: Classification, Embedding, Reward models

Quick Start

Basic Launch Command

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 30000

Llama 4 Launch (8xH100/H200)

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tp 8 \
  --context-length 1000000

Llama 4 Configuration

Hardware Recommendations

Model	Hardware	Context Length	Notes
Scout (109B)	8×H100	Up to 1M	Adjust `--context-length` to avoid OOM
Scout (109B)	8×H200	Up to 2.5M	Extended context support
Scout (109B) + Hybrid KV	8×H100	Up to 5M	With `--swa-full-tokens-ratio`
Scout (109B) + Hybrid KV	8×H200	Up to 10M	Maximum supported context
Maverick (400B)	8×H200	Up to 1M	Full precision
Maverick (400B)	8×B200	-	Optimal performance

Configuration Tips

Attention Backend Auto-Selection

SGLang automatically selects the optimal attention backend based on your hardware:

Blackwell GPUs (B200/GB200): trtllm_mha
Hopper GPUs (H100/H200): fa3 (FlashAttention 3)
AMD GPUs: aiter
Intel XPU: intel_xpu
Other platforms: triton (fallback)

To override auto-selection:

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tp 8 \
  --attention-backend fa3

Context Length Management

Adjust --context-length to avoid GPU out-of-memory issues:

# Scout on 8×H100 - up to 1M tokens
--context-length 1000000

# Scout on 8×H200 - up to 2.5M tokens
--context-length 2500000

Hybrid KV Cache

Enable hybrid KV cache for extended context lengths using Llama 4’s local attention layers:

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tp 8 \
  --context-length 5000000 \
  --swa-full-tokens-ratio 0.8  # Ratio of SWA layer KV tokens (default: 0.8, range: 0-1)

Chat Template

For chat completion tasks, add the Llama 4 chat template:

--chat-template llama-4

Multimodal Support

For Llama Vision models:

--enable-multimodal

EAGLE Speculative Decoding

Llama 4 Maverick (400B) supports EAGLE speculative decoding for accelerated inference.

Launch with EAGLE

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --trust-remote-code \
  --tp 8 \
  --context-length 1000000

Note: The Llama 4 EAGLE draft model (nvidia/Llama-4-Maverick-17B-128E-Eagle3) only recognizes conversations in chat mode.

Benchmarks

Accuracy Test (MMLU Pro)

SGLang achieves accuracy matching or exceeding official benchmarks:

Model	Official Benchmark	SGLang	Hardware
Llama-4-Scout-17B-16E-Instruct	74.3	75.2	8×H100
Llama-4-Maverick-17B-128E-Instruct	80.5	80.7	8×H100

Running Accuracy Tests

Llama-4-Scout

# Launch server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --port 30000 \
  --tp 8 \
  --mem-fraction-static 0.8 \
  --context-length 65536

# Run lm_eval
lm_eval --model local-chat-completions \
  --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 \
  --tasks mmlu_pro \
  --batch_size 128 \
  --apply_chat_template \
  --num_fewshot 0

Llama-4-Maverick

# Launch server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --port 30000 \
  --tp 8 \
  --mem-fraction-static 0.8 \
  --context-length 65536

# Run lm_eval
lm_eval --model local-chat-completions \
  --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 \
  --tasks mmlu_pro \
  --batch_size 128 \
  --apply_chat_template \
  --num_fewshot 0

Llama 3.x Models

Llama 3.x models (1B, 3B, 8B, 70B) are also fully supported:

# Llama 3.2 1B (lightweight)
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --port 30000

# Llama 3.1 8B (popular size)
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000

# Llama 3.1 70B (multi-GPU)
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp 4 \
  --port 30000

Specialized Llama Variants

SGLang supports specialized Llama model variants:

Embedding Models

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Embedding \
  --port 30000

Classification Models

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Classification \
  --port 30000

Reward Models

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Reward \
  --port 30000

Llama Vision (Multimodal)

Llama 3.2 includes vision-enabled variants (11B, 90B). See the Multimodal Models guide for detailed usage.

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
  --enable-multimodal \
  --port 30000

Advanced Features

EAGLE Decoding for Llama 3

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --speculative-algorithm EAGLE \
  --speculative-draft-model-path <eagle-draft-model> \
  --speculative-num-steps 3 \
  --tp 4

Quantization

SGLang supports various quantization methods for Llama models:

# FP8 quantization
--quantization fp8

# AWQ quantization
--quantization awq

# GPTQ quantization
--quantization gptq

Resources

Troubleshooting

Out of Memory (OOM)

Reduce --context-length:

--context-length 512000  # Reduce from 1M to 512K

Or reduce memory fraction:

--mem-fraction-static 0.8  # Reduce from default 0.9

Slow Model Loading

Increase timeout:

--watchdog-timeout 1200  # Increase to 20 minutes

Enable parallel weight loading:

--model-loader-extra-config '{"enable_multithread_load": true}'

Get Started

Core Concepts

Backend (Runtime)

Frontend (Language)

Model Support

Advanced Features

Distributed Serving

Optimization

Deployment

Observability

​Overview

​Supported Llama Models

​Quick Start

​Basic Launch Command

​Llama 4 Launch (8xH100/H200)

​Llama 4 Configuration

​Hardware Recommendations

​Configuration Tips

​Attention Backend Auto-Selection

​Context Length Management

​Hybrid KV Cache

​Chat Template

​Multimodal Support

​EAGLE Speculative Decoding

​Launch with EAGLE

​Benchmarks

​Accuracy Test (MMLU Pro)

​Running Accuracy Tests

​Llama-4-Scout

​Llama-4-Maverick

​Llama 3.x Models

​Specialized Llama Variants

​Embedding Models

​Classification Models

​Reward Models

​Llama Vision (Multimodal)

​Advanced Features

​EAGLE Decoding for Llama 3

​Quantization

​Resources

​Troubleshooting

​Out of Memory (OOM)

​Slow Model Loading

Overview

Supported Llama Models

Quick Start

Basic Launch Command

Llama 4 Launch (8xH100/H200)

Llama 4 Configuration

Hardware Recommendations

Configuration Tips

Attention Backend Auto-Selection

Context Length Management

Hybrid KV Cache

Chat Template

Multimodal Support

EAGLE Speculative Decoding

Launch with EAGLE

Benchmarks

Accuracy Test (MMLU Pro)

Running Accuracy Tests

Llama-4-Scout

Llama-4-Maverick

Llama 3.x Models

Specialized Llama Variants

Embedding Models

Classification Models

Reward Models

Llama Vision (Multimodal)

Advanced Features

EAGLE Decoding for Llama 3

Quantization

Resources

Troubleshooting

Out of Memory (OOM)

Slow Model Loading