Skip to main content
SGLang provides industry-leading speculative decoding implementations, including EAGLE-2/EAGLE-3, Multi-Token Prediction (MTP), standalone draft models, and n-gram speculation.
Our speculative decoding is considered among the fastest in open-source LLM engines.

Performance Highlights

Tested on LLaMA-3.1-8B-Instruct with MT-Bench (1× H100):
MethodThroughputSpeedup
SGLang baseline158.34 tokens/s1.0×
SGLang + EAGLE-2244.10 tokens/s1.54×
SGLang + EAGLE-3373.25 tokens/s2.36×
For details, see the EAGLE-3 paper.

Quick Guidance

Best Speed (Recommended)

EAGLE-3 - Use --speculative-algorithm EAGLE3

Broad Compatibility

EAGLE-2 - Use --speculative-algorithm EAGLE

MTP-Enabled Models

Multi-Token Prediction - Use MTP via speculative decoding

No Draft Model

N-gram - Use --speculative-algorithm NGRAM (CUDA-only)

Method Comparison

MethodDraft SourceSeparate Model?How to EnableNotes
EAGLE-2EAGLE draft modelYes--speculative-algorithm EAGLETune num-steps, eagle-topk, num-draft-tokens
EAGLE-3EAGLE-3 draft modelYes--speculative-algorithm EAGLE3Best throughput
MTPBuilt-in headsOften noSee MTP sectionModel-specific
STANDALONESmaller draft LLMYes--speculative-algorithm STANDALONEToken-level drafting
NGRAMN-gram cacheNo--speculative-algorithm NGRAMCUDA-only, no DP attention

EAGLE-2 Decoding

EAGLE-2 uses a specialized draft model that predicts feature vectors (hidden states) instead of tokens directly, enabling more accurate speculation.

Basic Setup

python -m sglang.launch_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 4 \
    --speculative-num-draft-tokens 16 \
    --mem-fraction-static 0.7 \
    --cuda-graph-max-bs 8

Making Requests

import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[{"role": "user", "content": "List 3 countries and their capitals."}],
    temperature=0,
    max_tokens=64,
)

print(response.choices[0].message.content)

Key Parameters

ParameterDescriptionDefault
--speculative-num-stepsDepth of autoregressive draftingAuto (5 for Llama, 3 for others)
--speculative-eagle-topkBranching factor per stepAuto (4 for Llama, 1 for others)
--speculative-num-draft-tokensMax parallel verification capacityAuto (8 for Llama, 4 for others)
--speculative-draft-model-pathPath to draft model weightsRequired
Use bench_speculative.py to find optimal parameter combinations for your workload.

EAGLE-3 Decoding

EAGLE-3 improves upon EAGLE-2 by:
  • Removing the feature prediction objective
  • Incorporating low and mid-layer features
  • Training in an on-policy manner
python -m sglang.launch_server \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 4 \
    --speculative-num-draft-tokens 16 \
    --mem-fraction-static 0.7 \
    --dtype float16
For training your own EAGLE-3 models, see SpecForge, the SGLang team’s training framework.

Advanced EAGLE Features

torch.compile Optimization

Enable kernel-level optimizations for the draft model:
python -m sglang.launch_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
    --enable-torch-compile \
    --torch-compile-max-bs 8
The benefit depends on hardware, model architecture, and batch size. On H100 with small draft models and CUDA graphs enabled, the improvement may be negligible. Always benchmark on your specific setup.

FR-Spec (Frequency-Ranked Speculation)

Reduce lm_head overhead by using a truncated high-frequency token vocabulary:
python -m sglang.launch_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B \
    --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
    --dtype float16
For more details, see the FR-Spec paper.

Multi-Token Prediction (MTP)

Some models have built-in multi-token prediction heads. Use speculative decoding to leverage them:
python -m sglang.launch_server \
    --model XiaomiMiMo/MiMo-7B-RL \
    --trust-remote-code \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 1 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 2
For DeepSeek MTP usage, see the DeepSeek V3.2 documentation.

Standalone Draft Model

Use a smaller model as a draft for token-level speculation:
python -m sglang.launch_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --speculative-algorithm STANDALONE \
    --speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 2 \
    --speculative-num-draft-tokens 7
Standalone speculative decoding does not support --enable-dp-attention.

N-gram Speculation

Use n-gram matching from previous generations (no separate draft model required):
python -m sglang.launch_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --speculative-algorithm NGRAM \
    --speculative-num-draft-tokens 16 \
    --speculative-ngram-max-match-window-size 12 \
    --speculative-ngram-max-bfs-breadth 10

N-gram Parameters

ParameterDescriptionDefault
--speculative-ngram-min-match-window-sizeMinimum matching window1
--speculative-ngram-max-match-window-sizeMaximum matching window12
--speculative-ngram-min-bfs-breadthMinimum BFS breadth1
--speculative-ngram-max-bfs-breadthMaximum BFS breadth10
--speculative-ngram-capacityCache capacity10,000,000
  • N-gram only supports CUDA
  • Does not support --enable-dp-attention
  • Disables overlap scheduler and mixed chunked prefill

Speculative Decoding V2 (Experimental)

Enable overlap scheduler for improved pipelining:
SGLANG_ENABLE_SPEC_V2=True python -m sglang.launch_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --speculative-algorithm STANDALONE \
    --speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 5
SpecV2 only supports --speculative-eagle-topk 1. Always set this explicitly when using SpecV2.

OOM Troubleshooting

Speculative decoding increases memory usage. If you encounter OOM errors:

Step 1: Lower Static Memory Fraction

--mem-fraction-static 0.5
This is the most effective adjustment.

Step 2: Reduce CUDA Graph Batch Size

--cuda-graph-max-bs 4  # or even 2

Step 3: Reduce Draft Tree Size

--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4

Step 4: Limit Concurrent Requests

--max-running-requests 4

Quick Recovery Recipe

If OOM, start with this minimal config:
python -m sglang.launch_server \
    --model <your-model> \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path <draft-model> \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --cuda-graph-max-bs 2 \
    --mem-fraction-static 0.5 \
    --max-running-requests 4
Then gradually increase parameters.

Implementation Details

EAGLE Process

  1. Feature Prediction: Draft model predicts next feature vector (last hidden state) using feature sequence and token sequence
  2. Tree Expansion: Branches out multiple continuations with speculative-eagle-topk branching factor
  3. Token Sampling: Samples tokens from lm_head(features)
  4. Verification: Target model verifies all draft tokens in parallel
  5. Acceptance: Accepts longest valid prefix
Source: python/sglang/srt/speculative/spec_info.py:15

SpeculativeAlgorithm Enum

class SpeculativeAlgorithm(Enum):
    EAGLE = auto()    # EAGLE-2
    EAGLE3 = auto()   # EAGLE-3
    STANDALONE = auto()  # Draft model
    NGRAM = auto()    # N-gram matching
    NONE = auto()
Source: python/sglang/srt/speculative/spec_info.py:15

Training EAGLE Models

For training your own EAGLE draft models:

Full Parameter Reference

ParameterTypeDefaultDescription
--speculative-algorithmstrNoneEAGLE, EAGLE3, STANDALONE, NGRAM
--speculative-draft-model-pathstrNonePath to draft model
--speculative-num-stepsintAutoDrafting depth
--speculative-eagle-topkintAutoBranching factor
--speculative-num-draft-tokensintAutoVerification capacity
ParameterTypeDefaultDescription
--speculative-token-mapstrNoneFR-Spec token map path
--speculative-draft-model-quantizationstrSame as targetDraft quantization
--speculative-attention-modestr”prefill”"prefill" or "decode"
--enable-torch-compileboolFalseEnable torch.compile
--enable-multi-layer-eagleboolAutoMulti-layer EAGLE

References