Speculative Decoding

SGLang provides industry-leading speculative decoding implementations, including EAGLE-2/EAGLE-3, Multi-Token Prediction (MTP), standalone draft models, and n-gram speculation.

Our speculative decoding is considered among the fastest in open-source LLM engines.

Performance Highlights

Tested on LLaMA-3.1-8B-Instruct with MT-Bench (1× H100):

Method	Throughput	Speedup
SGLang baseline	158.34 tokens/s	1.0×
SGLang + EAGLE-2	244.10 tokens/s	1.54×
SGLang + EAGLE-3	373.25 tokens/s	2.36×

For details, see the EAGLE-3 paper.

Quick Guidance

Best Speed (Recommended)

EAGLE-3 - Use --speculative-algorithm EAGLE3

Broad Compatibility

EAGLE-2 - Use --speculative-algorithm EAGLE

MTP-Enabled Models

Multi-Token Prediction - Use MTP via speculative decoding

No Draft Model

N-gram - Use --speculative-algorithm NGRAM (CUDA-only)

Method Comparison

Method	Draft Source	Separate Model?	How to Enable	Notes
EAGLE-2	EAGLE draft model	Yes	`--speculative-algorithm EAGLE`	Tune `num-steps`, `eagle-topk`, `num-draft-tokens`
EAGLE-3	EAGLE-3 draft model	Yes	`--speculative-algorithm EAGLE3`	Best throughput
MTP	Built-in heads	Often no	See MTP section	Model-specific
STANDALONE	Smaller draft LLM	Yes	`--speculative-algorithm STANDALONE`	Token-level drafting
NGRAM	N-gram cache	No	`--speculative-algorithm NGRAM`	CUDA-only, no DP attention

EAGLE-2 Decoding

EAGLE-2 uses a specialized draft model that predicts feature vectors (hidden states) instead of tokens directly, enabling more accurate speculation.

Basic Setup

python -m sglang.launch_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 4 \
    --speculative-num-draft-tokens 16 \
    --mem-fraction-static 0.7 \
    --cuda-graph-max-bs 8

Making Requests

import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[{"role": "user", "content": "List 3 countries and their capitals."}],
    temperature=0,
    max_tokens=64,
)

print(response.choices[0].message.content)

Key Parameters

Parameter	Description	Default
`--speculative-num-steps`	Depth of autoregressive drafting	Auto (5 for Llama, 3 for others)
`--speculative-eagle-topk`	Branching factor per step	Auto (4 for Llama, 1 for others)
`--speculative-num-draft-tokens`	Max parallel verification capacity	Auto (8 for Llama, 4 for others)
`--speculative-draft-model-path`	Path to draft model weights	Required

Use bench_speculative.py to find optimal parameter combinations for your workload.

EAGLE-3 Decoding

EAGLE-3 improves upon EAGLE-2 by:

Removing the feature prediction objective
Incorporating low and mid-layer features
Training in an on-policy manner

python -m sglang.launch_server \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 4 \
    --speculative-num-draft-tokens 16 \
    --mem-fraction-static 0.7 \
    --dtype float16

For training your own EAGLE-3 models, see SpecForge, the SGLang team’s training framework.

Advanced EAGLE Features

torch.compile Optimization

Enable kernel-level optimizations for the draft model:

python -m sglang.launch_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
    --enable-torch-compile \
    --torch-compile-max-bs 8

When does torch.compile help?

The benefit depends on hardware, model architecture, and batch size. On H100 with small draft models and CUDA graphs enabled, the improvement may be negligible. Always benchmark on your specific setup.

FR-Spec (Frequency-Ranked Speculation)

Reduce lm_head overhead by using a truncated high-frequency token vocabulary:

python -m sglang.launch_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B \
    --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
    --dtype float16

For more details, see the FR-Spec paper.

Multi-Token Prediction (MTP)

Some models have built-in multi-token prediction heads. Use speculative decoding to leverage them:

python -m sglang.launch_server \
    --model XiaomiMiMo/MiMo-7B-RL \
    --trust-remote-code \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 1 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 2

For DeepSeek MTP usage, see the DeepSeek V3.2 documentation.

Standalone Draft Model

Use a smaller model as a draft for token-level speculation:

python -m sglang.launch_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --speculative-algorithm STANDALONE \
    --speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 2 \
    --speculative-num-draft-tokens 7

Standalone speculative decoding does not support --enable-dp-attention.

N-gram Speculation

Use n-gram matching from previous generations (no separate draft model required):

python -m sglang.launch_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --speculative-algorithm NGRAM \
    --speculative-num-draft-tokens 16 \
    --speculative-ngram-max-match-window-size 12 \
    --speculative-ngram-max-bfs-breadth 10

N-gram Parameters

Parameter	Description	Default
`--speculative-ngram-min-match-window-size`	Minimum matching window	1
`--speculative-ngram-max-match-window-size`	Maximum matching window	12
`--speculative-ngram-min-bfs-breadth`	Minimum BFS breadth	1
`--speculative-ngram-max-bfs-breadth`	Maximum BFS breadth	10
`--speculative-ngram-capacity`	Cache capacity	10,000,000

N-gram only supports CUDA
Does not support --enable-dp-attention
Disables overlap scheduler and mixed chunked prefill