How It Works
Speculative decoding operates on a simple principle:- A draft model (smaller/faster) generates candidate tokens
- The target model (larger/accurate) verifies all candidates in parallel
- Matching tokens are accepted, non-matching tokens trigger rejection and resampling
- The process repeats until the sequence is complete
Speculation is most effective at low batch sizes where GPU utilization is not saturated. Speed improvements are typically observed when batch size is small.
Supported Methods
TensorRT-LLM supports multiple speculative decoding algorithms:Draft/Target
Use any smaller model as a draft for the target model
EAGLE 3
Specialized draft models trained for speculative decoding
N-gram
Prompt lookup decoding using token prefix matching
MTP
Multi-token prediction for DeepSeek models
Quick Start Examples
Draft/Target Decoding
The simplest form of speculative decoding uses an arbitrary draft model:EAGLE 3 Decoding
EAGLE 3 uses specialized draft models trained specifically for speculative decoding:Available EAGLE 3 Models
- LLaMA 3
- LLaMA 4 Maverick
- Other Models
Use checkpoints from the original EAGLE 3 authors:
yuhuili/EAGLE3-LLaMA3.1-Instruct-8Byuhuili/EAGLE3-LLaMA3.1-Instruct-70B
TensorRT-LLM supports a modified version of EAGLE 3. Tree structures for draft sequences are not supported; instead, each request uses a single sequence of draft tokens with length
max_draft_len.N-gram Decoding (Prompt Lookup)
N-gram decoding maintains a map from token prefixes to candidate draft sequences:N-gram Configuration Options
| Parameter | Type | Description |
|---|---|---|
max_draft_len | int | Maximum draft candidate length |
max_matching_ngram_size | int | Maximum prompt suffix length to match with keys in the pool |
is_public_pool | bool | If true, a single n-gram pool is shared for all requests |
is_keep_all | bool | If true, draft candidates are retained forever. Otherwise, only the largest candidate is retained |
is_use_oldest | bool | If true, use the oldest draft candidate. Only applicable if is_keep_all == True |
N-gram decoding works particularly well for repetitive text patterns and when prompts share common prefixes.
MTP (Multi-Token Prediction)
MTP is currently only supported for DeepSeek models:MTP Configuration Options
| Parameter | Type | Description |
|---|---|---|
max_draft_len | int | Maximum draft candidate length (must match num_nextn_predict_layers) |
num_nextn_predict_layers | int | Number of MTP modules to use |
use_relaxed_acceptance_for_thinking | bool | Use relaxed decoding for reasoning models in thinking phase |
relaxed_topk | int | Top K tokens sampled for relaxed decoding candidate set |
relaxed_delta | float | Delta threshold for filtering relaxed decoding candidates |
Relaxed acceptance mode allows draft tokens to be accepted if they appear in a candidate set, providing more flexibility during the reasoning phase.
User-Provided Drafting
For advanced use cases, you can implement custom drafting logic:Using with trtllm-bench and trtllm-serve
Speculative decoding options must be specified via--config config.yaml for both trtllm-bench and trtllm-serve.
YAML Configuration
Create aconfig.yaml file:
- EAGLE 3
- Draft/Target
- N-gram
- MTP
Running Benchmarks
Serving with Speculation
The field name
speculative_model_dir can also be used as an alias for speculative_config.speculative_model.Performance Considerations
When to Use Speculative Decoding
When to Use Speculative Decoding
Speculative decoding is most effective when:
- Batch size is small (1-4 requests)
- GPU utilization is low
- Latency reduction is more important than throughput
- Draft model is significantly faster than target model
Choosing max_draft_len
Choosing max_draft_len
- Higher values (4-6): Better speedup potential, but lower acceptance rate
- Lower values (2-3): Higher acceptance rate, but less speedup per iteration
- Optimal value depends on model similarity and task
- Start with 3 and tune based on acceptance rate metrics
Draft Model Selection
Draft Model Selection
- Use models from the same family (better tokenizer match)
- Smaller draft models (1-3B) work well with larger targets (7B+)
- EAGLE-trained models typically achieve higher acceptance rates
- Test acceptance rate on representative prompts
Backend Support
Additional Resources
EAGLE 3 Paper
Technical paper: EAGLE-3: Scaling up Inference Acceleration
Speculative Decoding Modules
Browse NVIDIA’s draft model collection
Prompt Lookup Decoding
Original N-gram implementation reference