Overview
Speculative decoding works by generating draft tokens quickly and then verifying them with the target model in a single batch. When draft predictions are frequently correct, this approach provides substantial speedups.How It Works
Draft Generation
A smaller, faster draft model (or pattern matcher) generates multiple candidate tokens
Batch Verification
The main model verifies all draft tokens in a single forward pass (like prompt processing)
Quick Start
With Draft Model
Without Draft Model (Pattern Matching)
Implementations
llama-server supports several speculative decoding implementations that can be mixed.Draft Model
A smaller model generates draft tokens. This is the most common approach.Path to draft model file (GGUF format)
Number of tokens to draft per iteration
Minimum number of draft tokens to use
Minimum probability for accepting draft tokens (greedy threshold)
N-gram Simple
Searches token history for the last matching n-gram and uses the following m tokens as draft. Best for: Code refactoring, iterating over similar text- Minimal overhead
- No additional model needed
- Relies on patterns already in context
- Works well when text has repetitive structure
N-gram Map (Key)
Looks for the current n-gram in token history and creates drafts from frequently repeated sequences. Best for: Repetitive tasks, structured outputLength of lookup n-gram (how many tokens to look back)
Length of draft m-gram (how many tokens to draft)
Minimum occurrences before using as draft
- Uses internal hash-map of n-grams
- Tracks acceptance statistics
- Configurable minimum occurrences threshold
N-gram Map Key-4-Values (Experimental)
Tracks up to 4 possible continuations for each n-gram key and selects the most frequent. Best for: Scenarios with multiple common continuations- Experimental implementation
- Tracks multiple possible continuations
- Useful for longer repetitions
N-gram Mod
Uses a hash pool with LCG (Linear Congruential Generator) for n-gram storage. Best for: Long-running servers, reasoning models, summarization- Lightweight (~16 MB memory)
- Constant memory and complexity
- Variable draft lengths
- Shared hash pool across all server slots (different requests benefit each other)
- Iterating over blocks of text/code
- Reasoning models (repeating thinking in final answer)
- Summarization tasks
N-gram Cache
Maintains statistics about short n-gram sequences. Can load external statistics from files.- Computes draft using probability statistics
- Can improve with external data
- Memory overhead for statistics
Configuration
Draft Model Settings
Threading
KV Cache for Draft
Choosing an Implementation
Decision Matrix
| Use Case | Recommended Implementation | Reason |
|---|---|---|
| General speedup | Draft model | Best overall performance |
| Code refactoring | ngram-simple | Repeated patterns |
| Structured output | ngram-map-k | Frequent sequences |
| Long sessions | ngram-mod | Shared learning across requests |
| Reasoning models | ngram-mod | Captures thinking patterns |
| Limited memory | ngram-simple | Minimal overhead |
| Maximum speed | Draft model + ngram-mod | Hybrid approach |
Combining Implementations
You can mix a draft model with draftless decoding (draftless takes precedence):Examples
Code Generation with Draft Model
Code Refactoring with Pattern Matching
Reasoning Model
High-Speed Server
Performance Monitoring
Speculative decoding prints statistics to help tune performance:Example Output
Metrics Explained
- acceptance rate: Proportion of draft tokens accepted by main model (higher is better)
- #calls(b,g,a): Number of calls for begin/generation/accumulation
- #gen drafts: Number of draft sequences generated
- #acc drafts: Number of drafts partially/fully accepted
- #gen tokens: Total tokens generated (including rejected)
- #acc tokens: Tokens accepted by main model
- dur(b,g,a): Durations in milliseconds for begin/generation/accumulation
Tuning Tips
High acceptance rate (>60%): Good configuration, consider increasing--draft-max for more speedup
Low acceptance rate (<40%): Try:
- Decrease
--draft-max - Increase
--draft-p-min(more conservative) - Choose different draft model
- Switch to pattern-based method
- Increase
--spec-ngram-size-nfor longer patterns - Adjust
--spec-ngram-min-hitsbased on repetition
Selecting a Draft Model
Requirements
Good draft models are:- Much smaller than the main model (5-20x smaller)
- Same tokenizer as the main model
- Same or similar architecture family
Recommended Pairs
| Main Model | Draft Model | Speedup |
|---|---|---|
| Llama 3.1 70B | Llama 3.2 1B | ~2-3x |
| Llama 3.1 70B | Llama 3.2 3B | ~2-2.5x |
| Qwen2.5 72B | Qwen2.5 7B | ~2-2.5x |
| DeepSeek Coder 33B | Qwen2.5 Coder 1.5B | ~2-3x |
| Mixtral 8x7B | Mistral 7B | ~1.5-2x |
Using Pre-configured Pairs
Some llama-server flags load pre-configured model pairs:Advanced Configuration
Token Replacement
For incompatible tokenizers between main and draft models:MoE Models
For Mixture-of-Experts draft models:Benchmarking
To measure speculative decoding effectiveness:Troubleshooting
Low Acceptance Rate
Issue: Draft tokens frequently rejected Solutions:- Verify draft model uses same tokenizer
- Try a different draft model
- Reduce
--draft-max - Increase
--draft-p-min - Check if task suits speculative decoding
No Speedup or Slowdown
Issue: Performance worse with speculation Solutions:- Draft model too large (should be 5-20x smaller)
- Ensure both models on GPU:
-ngl 99 -ngld 99 - Reduce
--draft-max - Try pattern-based method instead
- Task may not have predictable patterns
Memory Issues
Issue: Out of memory with draft model Solutions:- Use smaller draft model
- Quantize draft model KV cache:
-ctkd q8_0 -ctvd q8_0 - Keep draft model on CPU, main on GPU
- Reduce draft context size:
-cd 1024 - Use pattern-based method (no draft model)
Pattern Methods Not Working
Issue: ngram methods show no speedup Solutions:- Increase context size (patterns need history)
- Adjust
--spec-ngram-size-nand--spec-ngram-size-m - Try different ngram implementation
- Task may lack repetitive patterns
- Use draft model instead
Performance Tips
- Start conservative: Begin with
--draft 8and increase based on acceptance rate - Monitor acceptance: Aim for >50% acceptance rate for worthwhile speedup
- GPU both models: Put both main and draft on GPU for best performance
- Match context: Draft model context should be sufficient for current task
- Profile different methods: Test multiple implementations for your use case
- Combine methods: Mix draft model with pattern matching for hybrid approach
See Also
Server
Server configuration and API
CLI Tool
Command-line usage
Model Quantization
Optimize draft models
Performance Guide
General optimization tips

