Skip to main content
Meta’s Llama series represents one of the most widely-used families of open-source large language models, ranging from 7B to 400B parameters across Llama 2, Llama 3, and Llama 4 generations.

Overview

Llama 4 is Meta’s latest generation with industry-leading performance. SGLang has provided first-class support and optimizations for Llama models since v0.4.5.

Supported Llama Models

  • Llama 4 Scout (109B) - Latest generation
  • Llama 4 Maverick (400B) - Largest Llama model
  • Llama 3.x series (1B, 3B, 8B, 70B) - Previous generation
  • Llama 2 series (7B, 13B, 70B) - Foundation models
  • Llama Vision (11B, 90B) - Multimodal variants
  • Specialized variants: Classification, Embedding, Reward models

Quick Start

Basic Launch Command

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 30000

Llama 4 Launch (8xH100/H200)

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tp 8 \
  --context-length 1000000

Llama 4 Configuration

Hardware Recommendations

ModelHardwareContext LengthNotes
Scout (109B)8×H100Up to 1MAdjust --context-length to avoid OOM
Scout (109B)8×H200Up to 2.5MExtended context support
Scout (109B) + Hybrid KV8×H100Up to 5MWith --swa-full-tokens-ratio
Scout (109B) + Hybrid KV8×H200Up to 10MMaximum supported context
Maverick (400B)8×H200Up to 1MFull precision
Maverick (400B)8×B200-Optimal performance

Configuration Tips

Attention Backend Auto-Selection

SGLang automatically selects the optimal attention backend based on your hardware:
  • Blackwell GPUs (B200/GB200): trtllm_mha
  • Hopper GPUs (H100/H200): fa3 (FlashAttention 3)
  • AMD GPUs: aiter
  • Intel XPU: intel_xpu
  • Other platforms: triton (fallback)
To override auto-selection:
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tp 8 \
  --attention-backend fa3

Context Length Management

Adjust --context-length to avoid GPU out-of-memory issues:
# Scout on 8×H100 - up to 1M tokens
--context-length 1000000

# Scout on 8×H200 - up to 2.5M tokens
--context-length 2500000

Hybrid KV Cache

Enable hybrid KV cache for extended context lengths using Llama 4’s local attention layers:
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tp 8 \
  --context-length 5000000 \
  --swa-full-tokens-ratio 0.8  # Ratio of SWA layer KV tokens (default: 0.8, range: 0-1)

Chat Template

For chat completion tasks, add the Llama 4 chat template:
--chat-template llama-4

Multimodal Support

For Llama Vision models:
--enable-multimodal

EAGLE Speculative Decoding

Llama 4 Maverick (400B) supports EAGLE speculative decoding for accelerated inference.

Launch with EAGLE

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --trust-remote-code \
  --tp 8 \
  --context-length 1000000
Note: The Llama 4 EAGLE draft model (nvidia/Llama-4-Maverick-17B-128E-Eagle3) only recognizes conversations in chat mode.

Benchmarks

Accuracy Test (MMLU Pro)

SGLang achieves accuracy matching or exceeding official benchmarks:
ModelOfficial BenchmarkSGLangHardware
Llama-4-Scout-17B-16E-Instruct74.375.28×H100
Llama-4-Maverick-17B-128E-Instruct80.580.78×H100

Running Accuracy Tests

Llama-4-Scout

# Launch server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --port 30000 \
  --tp 8 \
  --mem-fraction-static 0.8 \
  --context-length 65536

# Run lm_eval
lm_eval --model local-chat-completions \
  --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 \
  --tasks mmlu_pro \
  --batch_size 128 \
  --apply_chat_template \
  --num_fewshot 0

Llama-4-Maverick

# Launch server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --port 30000 \
  --tp 8 \
  --mem-fraction-static 0.8 \
  --context-length 65536

# Run lm_eval
lm_eval --model local-chat-completions \
  --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 \
  --tasks mmlu_pro \
  --batch_size 128 \
  --apply_chat_template \
  --num_fewshot 0

Llama 3.x Models

Llama 3.x models (1B, 3B, 8B, 70B) are also fully supported:
# Llama 3.2 1B (lightweight)
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --port 30000

# Llama 3.1 8B (popular size)
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000

# Llama 3.1 70B (multi-GPU)
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp 4 \
  --port 30000

Specialized Llama Variants

SGLang supports specialized Llama model variants:

Embedding Models

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Embedding \
  --port 30000

Classification Models

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Classification \
  --port 30000

Reward Models

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Reward \
  --port 30000

Llama Vision (Multimodal)

Llama 3.2 includes vision-enabled variants (11B, 90B). See the Multimodal Models guide for detailed usage.
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
  --enable-multimodal \
  --port 30000

Advanced Features

EAGLE Decoding for Llama 3

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --speculative-algorithm EAGLE \
  --speculative-draft-model-path <eagle-draft-model> \
  --speculative-num-steps 3 \
  --tp 4

Quantization

SGLang supports various quantization methods for Llama models:
# FP8 quantization
--quantization fp8

# AWQ quantization
--quantization awq

# GPTQ quantization
--quantization gptq

Resources

Troubleshooting

Out of Memory (OOM)

Reduce --context-length:
--context-length 512000  # Reduce from 1M to 512K
Or reduce memory fraction:
--mem-fraction-static 0.8  # Reduce from default 0.9

Slow Model Loading

Increase timeout:
--watchdog-timeout 1200  # Increase to 20 minutes
Enable parallel weight loading:
--model-loader-extra-config '{"enable_multithread_load": true}'