Our speculative decoding is considered among the fastest in open-source LLM engines.
Performance Highlights
Tested on LLaMA-3.1-8B-Instruct with MT-Bench (1× H100):| Method | Throughput | Speedup |
|---|---|---|
| SGLang baseline | 158.34 tokens/s | 1.0× |
| SGLang + EAGLE-2 | 244.10 tokens/s | 1.54× |
| SGLang + EAGLE-3 | 373.25 tokens/s | 2.36× |
Quick Guidance
Best Speed (Recommended)
EAGLE-3 - Use
--speculative-algorithm EAGLE3Broad Compatibility
EAGLE-2 - Use
--speculative-algorithm EAGLEMTP-Enabled Models
Multi-Token Prediction - Use MTP via speculative decoding
No Draft Model
N-gram - Use
--speculative-algorithm NGRAM (CUDA-only)Method Comparison
| Method | Draft Source | Separate Model? | How to Enable | Notes |
|---|---|---|---|---|
| EAGLE-2 | EAGLE draft model | Yes | --speculative-algorithm EAGLE | Tune num-steps, eagle-topk, num-draft-tokens |
| EAGLE-3 | EAGLE-3 draft model | Yes | --speculative-algorithm EAGLE3 | Best throughput |
| MTP | Built-in heads | Often no | See MTP section | Model-specific |
| STANDALONE | Smaller draft LLM | Yes | --speculative-algorithm STANDALONE | Token-level drafting |
| NGRAM | N-gram cache | No | --speculative-algorithm NGRAM | CUDA-only, no DP attention |
EAGLE-2 Decoding
EAGLE-2 uses a specialized draft model that predicts feature vectors (hidden states) instead of tokens directly, enabling more accurate speculation.Basic Setup
Making Requests
Key Parameters
| Parameter | Description | Default |
|---|---|---|
--speculative-num-steps | Depth of autoregressive drafting | Auto (5 for Llama, 3 for others) |
--speculative-eagle-topk | Branching factor per step | Auto (4 for Llama, 1 for others) |
--speculative-num-draft-tokens | Max parallel verification capacity | Auto (8 for Llama, 4 for others) |
--speculative-draft-model-path | Path to draft model weights | Required |
EAGLE-3 Decoding
EAGLE-3 improves upon EAGLE-2 by:- Removing the feature prediction objective
- Incorporating low and mid-layer features
- Training in an on-policy manner
For training your own EAGLE-3 models, see SpecForge, the SGLang team’s training framework.
Advanced EAGLE Features
torch.compile Optimization
Enable kernel-level optimizations for the draft model:When does torch.compile help?
When does torch.compile help?
The benefit depends on hardware, model architecture, and batch size. On H100 with small draft models and CUDA graphs enabled, the improvement may be negligible. Always benchmark on your specific setup.
FR-Spec (Frequency-Ranked Speculation)
Reducelm_head overhead by using a truncated high-frequency token vocabulary:
Multi-Token Prediction (MTP)
Some models have built-in multi-token prediction heads. Use speculative decoding to leverage them:Standalone Draft Model
Use a smaller model as a draft for token-level speculation:N-gram Speculation
Use n-gram matching from previous generations (no separate draft model required):N-gram Parameters
| Parameter | Description | Default |
|---|---|---|
--speculative-ngram-min-match-window-size | Minimum matching window | 1 |
--speculative-ngram-max-match-window-size | Maximum matching window | 12 |
--speculative-ngram-min-bfs-breadth | Minimum BFS breadth | 1 |
--speculative-ngram-max-bfs-breadth | Maximum BFS breadth | 10 |
--speculative-ngram-capacity | Cache capacity | 10,000,000 |
Speculative Decoding V2 (Experimental)
Enable overlap scheduler for improved pipelining:OOM Troubleshooting
Speculative decoding increases memory usage. If you encounter OOM errors:Step 1: Lower Static Memory Fraction
Step 2: Reduce CUDA Graph Batch Size
Step 3: Reduce Draft Tree Size
Step 4: Limit Concurrent Requests
Quick Recovery Recipe
If OOM, start with this minimal config:Implementation Details
EAGLE Process
- Feature Prediction: Draft model predicts next feature vector (last hidden state) using feature sequence and token sequence
- Tree Expansion: Branches out multiple continuations with
speculative-eagle-topkbranching factor - Token Sampling: Samples tokens from
lm_head(features) - Verification: Target model verifies all draft tokens in parallel
- Acceptance: Accepts longest valid prefix
python/sglang/srt/speculative/spec_info.py:15
SpeculativeAlgorithm Enum
python/sglang/srt/speculative/spec_info.py:15
Training EAGLE Models
For training your own EAGLE draft models:- EAGLE-2: See EAGLE repo
- EAGLE-3: See SpecForge and blog post
Full Parameter Reference
Core Parameters
Core Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--speculative-algorithm | str | None | EAGLE, EAGLE3, STANDALONE, NGRAM |
--speculative-draft-model-path | str | None | Path to draft model |
--speculative-num-steps | int | Auto | Drafting depth |
--speculative-eagle-topk | int | Auto | Branching factor |
--speculative-num-draft-tokens | int | Auto | Verification capacity |
Advanced Parameters
Advanced Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--speculative-token-map | str | None | FR-Spec token map path |
--speculative-draft-model-quantization | str | Same as target | Draft quantization |
--speculative-attention-mode | str | ”prefill” | "prefill" or "decode" |
--enable-torch-compile | bool | False | Enable torch.compile |
--enable-multi-layer-eagle | bool | Auto | Multi-layer EAGLE |
References
- EAGLE-2 Paper
- EAGLE-3 Paper
- FR-Spec Paper
- S-LoRA Paper (tensor sharding strategy)
- SpecForge Training Framework
