Skip to main content
Speculative decoding is a technique for accelerating LLM inference at low batch sizes. A lightweight drafting mechanism proposes candidate tokens, and the target model verifies them in a single forward pass. Tokens that match are accepted, reducing the number of sequential forward passes needed.

How It Works

Speculative decoding operates on a simple principle:
  1. A draft model (smaller/faster) generates candidate tokens
  2. The target model (larger/accurate) verifies all candidates in parallel
  3. Matching tokens are accepted, non-matching tokens trigger rejection and resampling
  4. The process repeats until the sequence is complete
Speculation is most effective at low batch sizes where GPU utilization is not saturated. Speed improvements are typically observed when batch size is small.

Supported Methods

TensorRT-LLM supports multiple speculative decoding algorithms:

Draft/Target

Use any smaller model as a draft for the target model

EAGLE 3

Specialized draft models trained for speculative decoding

N-gram

Prompt lookup decoding using token prefix matching

MTP

Multi-token prediction for DeepSeek models

Quick Start Examples

Draft/Target Decoding

The simplest form of speculative decoding uses an arbitrary draft model:
from tensorrt_llm.llmapi import DraftTargetDecodingConfig
from tensorrt_llm import LLM

# Option 1: Use a HuggingFace Hub model ID (auto-downloaded)
speculative_config = DraftTargetDecodingConfig(
    max_draft_len=3,
    speculative_model="yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
)

# Option 2: Use a local path
# speculative_config = DraftTargetDecodingConfig(
#     max_draft_len=3,
#     speculative_model="/path/to/draft_model"
# )

llm = LLM(
    "/path/to/target_model",
    speculative_config=speculative_config,
    disable_overlap_scheduler=True
)

outputs = llm.generate("What is the capital of France?")
Make sure the draft and target models use the same tokenizer. Mismatched tokenizers result in extremely low acceptance rates and degraded performance.

EAGLE 3 Decoding

EAGLE 3 uses specialized draft models trained specifically for speculative decoding:
from tensorrt_llm.llmapi import Eagle3DecodingConfig
from tensorrt_llm import LLM

model = "meta-llama/Llama-3.1-8B-Instruct"
speculative_model = "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"

speculative_config = Eagle3DecodingConfig(
    max_draft_len=3,
    speculative_model=speculative_model
)

llm = LLM(model, speculative_config=speculative_config)
outputs = llm.generate("Explain quantum computing")

Available EAGLE 3 Models

Use checkpoints from the original EAGLE 3 authors:
  • yuhuili/EAGLE3-LLaMA3.1-Instruct-8B
  • yuhuili/EAGLE3-LLaMA3.1-Instruct-70B
TensorRT-LLM supports a modified version of EAGLE 3. Tree structures for draft sequences are not supported; instead, each request uses a single sequence of draft tokens with length max_draft_len.

N-gram Decoding (Prompt Lookup)

N-gram decoding maintains a map from token prefixes to candidate draft sequences:
from tensorrt_llm.llmapi import NGramDecodingConfig
from tensorrt_llm import LLM

speculative_config = NGramDecodingConfig(
    max_draft_len=3,
    max_matching_ngram_size=4,
    is_public_pool=True
)

llm = LLM(
    "/path/to/target_model",
    speculative_config=speculative_config,
    disable_overlap_scheduler=True
)

outputs = llm.generate("The quick brown fox jumps over the lazy dog")

N-gram Configuration Options

ParameterTypeDescription
max_draft_lenintMaximum draft candidate length
max_matching_ngram_sizeintMaximum prompt suffix length to match with keys in the pool
is_public_poolboolIf true, a single n-gram pool is shared for all requests
is_keep_allboolIf true, draft candidates are retained forever. Otherwise, only the largest candidate is retained
is_use_oldestboolIf true, use the oldest draft candidate. Only applicable if is_keep_all == True
N-gram decoding works particularly well for repetitive text patterns and when prompts share common prefixes.

MTP (Multi-Token Prediction)

MTP is currently only supported for DeepSeek models:
from tensorrt_llm.llmapi import MTPDecodingConfig
from tensorrt_llm import LLM

speculative_config = MTPDecodingConfig(
    max_draft_len=3,
    num_nextn_predict_layers=3
)

llm = LLM(
    "/path/to/deepseek_model",
    speculative_config=speculative_config
)

outputs = llm.generate("Solve the following problem: 2 + 2 = ?")

MTP Configuration Options

ParameterTypeDescription
max_draft_lenintMaximum draft candidate length (must match num_nextn_predict_layers)
num_nextn_predict_layersintNumber of MTP modules to use
use_relaxed_acceptance_for_thinkingboolUse relaxed decoding for reasoning models in thinking phase
relaxed_topkintTop K tokens sampled for relaxed decoding candidate set
relaxed_deltafloatDelta threshold for filtering relaxed decoding candidates
Relaxed acceptance mode allows draft tokens to be accepted if they appear in a candidate set, providing more flexibility during the reasoning phase.

User-Provided Drafting

For advanced use cases, you can implement custom drafting logic:
from tensorrt_llm.llmapi import UserProvidedDecodingConfig
from tensorrt_llm import LLM

# Implement your custom drafter
class MyDrafter:
    def prepare_draft_tokens(self, ...):
        # Your custom drafting logic
        pass

speculative_config = UserProvidedDecodingConfig(
    max_draft_len=3,
    drafter=MyDrafter()
)

llm = LLM(
    "/path/to/target_model",
    speculative_config=speculative_config
)

Using with trtllm-bench and trtllm-serve

Speculative decoding options must be specified via --config config.yaml for both trtllm-bench and trtllm-serve.

YAML Configuration

Create a config.yaml file:
speculative_config:
  decoding_type: Eagle3
  max_draft_len: 4
  speculative_model: yuhuili/EAGLE3-LLaMA3.1-Instruct-8B

Running Benchmarks

trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  throughput \
  --dataset /path/to/dataset.json \
  --config config.yaml

Serving with Speculation

trtllm-serve meta-llama/Llama-3.1-8B-Instruct \
  --config config.yaml \
  --port 8000
The field name speculative_model_dir can also be used as an alias for speculative_config.speculative_model.

Performance Considerations

Speculative decoding is most effective when:
  • Batch size is small (1-4 requests)
  • GPU utilization is low
  • Latency reduction is more important than throughput
  • Draft model is significantly faster than target model
  • Higher values (4-6): Better speedup potential, but lower acceptance rate
  • Lower values (2-3): Higher acceptance rate, but less speedup per iteration
  • Optimal value depends on model similarity and task
  • Start with 3 and tune based on acceptance rate metrics
  • Use models from the same family (better tokenizer match)
  • Smaller draft models (1-3B) work well with larger targets (7B+)
  • EAGLE-trained models typically achieve higher acceptance rates
  • Test acceptance rate on representative prompts

Backend Support

The PyTorch backend currently only supports EAGLE 3. The decoding_type: Eagle is accepted as a backward-compatible alias for Eagle3, but EAGLE v1/v2 draft checkpoints are incompatible.

Additional Resources

EAGLE 3 Paper

Technical paper: EAGLE-3: Scaling up Inference Acceleration

Speculative Decoding Modules

Browse NVIDIA’s draft model collection

Prompt Lookup Decoding

Original N-gram implementation reference

Build docs developers (and LLMs) love