How it works
Speculative decoding works by:- A fast draft model generates multiple candidate tokens
- The target model verifies all candidates in parallel in a single forward pass
- Accepted tokens are kept, rejected tokens are resampled
- The process repeats until the sequence is complete
vLLM’s speculative decoding is algorithmically lossless and theoretically lossless up to floating-point precision. Minor variations may occur due to hardware numerics or batch size changes affecting numerical stability.
Speculation methods
vLLM supports multiple speculative decoding methods optimized for different scenarios:EAGLE (best latency)
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) uses a lightweight auto-regressive model trained to predict next tokens from the target model’s hidden states.- Best latency reduction (typically 2-3x speedup)
- Requires trained EAGLE model for your target model
- Higher memory overhead during peak traffic
Draft model
Use a smaller model of the same architecture as the draft model.- Good latency reduction (1.5-2.5x speedup)
- No special training required - use any smaller model
- Works well when draft and target models are from the same family
MLP speculator
Uses a multi-layer perceptron trained on the target model’s embeddings to predict tokens.- Minimal memory overhead
- Requires trained MLP speculator
- Good for memory-constrained scenarios
N-gram (no additional model)
Uses n-gram matching to find repeated patterns in the prompt and predict likely continuations.- No additional model or memory required
- Modest speedup (1.2-1.5x)
- Works best with repetitive text
- Safe to use during peak traffic
Suffix decoding
Optimized for scenarios where outputs share common suffixes.- Effective for templated outputs
- No additional model needed
- Best for structured generation tasks
Configuration parameters
Number of speculative tokens
How many tokens the draft model generates before verification.
- Higher values: more potential speedup but lower acceptance rate
- Lower values: higher acceptance rate but less parallelism
- Typical range: 3-5 for EAGLE/draft models, 2-3 for MLP
Parallel drafting
Enable parallel token generation in the draft model.
- Can improve throughput for certain draft models
- May increase memory usage
Disable padded drafter batch
Disable padding in draft model batches for potentially better performance.
Complete example
Here’s a complete example using speculative decoding with EAGLE:Performance metrics
Monitor these metrics to evaluate speculative decoding effectiveness:- Acceptance rate: Percentage of drafted tokens accepted by the target model
- Mean acceptance length: Average number of tokens accepted per draft
- Speedup: Actual latency reduction achieved
Best practices
When to use speculative decoding
✅ Use when:- You have medium-to-low query per second (QPS) workloads
- Latency is more important than throughput
- Memory is available for draft model
- Outputs are relatively short to medium length
- Running at maximum throughput/capacity
- Memory is constrained
- Generating very long sequences (decoding time dominates)
- Batch sizes are already large
Method selection guide
- Lowest latency: EAGLE or draft model
- No memory overhead: N-gram or suffix decoding
- Peak traffic safe: N-gram, suffix, or MLP
- Easiest setup: N-gram (no additional model needed)
Tuning acceptance rate
If acceptance rate is low (<40%):- Reduce
num_speculative_tokens - Try a different draft model (if using draft_model method)
- Consider whether your workload suits speculative decoding
- Increase
num_speculative_tokensfor more speedup - Consider using a more aggressive draft model
Known limitations
- Speculative decoding with draft models not supported in vLLM <= 0.10.0
min_pandlogit_biassampling parameters not yet supported with speculative decoding- Output variations may occur due to floating-point precision differences
Lossless guarantees
vLLM’s speculative decoding provides three levels of losslessness:- Theoretical: Lossless up to floating-point precision limits
- Algorithmic: Rejection sampler validated to match target distribution
- Practical: Greedy sampling with spec decode matches greedy without it
Training your own speculators
To train custom EAGLE or MLP speculators optimized for your target model, see vllm-project/speculators for training scripts and integration guides.Related resources
- Speculative Decoding paper - Original research
- Sampling parameters - Configure generation behavior
- Source:
docs/features/speculative_decoding/README.md - Example:
examples/offline_inference/spec_decode.py