Why Use Dictionaries?
Traditional compression algorithms rely on finding repetitive patterns within a single file. Small files often don’t have enough repetition to compress well. Dictionaries solve this by:- Pre-learning common patterns from sample data
- Sharing patterns across files that have similar structures
- Improving compression ratio by 2x or more for small files
- Reducing header overhead through entropy tables
Training a Dictionary
Before using dictionary compression, you need to train a dictionary on representative samples.Using the Command Line
The easiest way to create a dictionary is with the zstd CLI:Using the API
You can also train dictionaries programmatically using thezdict.h API:
Training Guidelines
- Sample count: Provide at least 100x the dictionary size in samples (a few thousand samples)
- Dictionary size: 100KB is a reasonable default
- Sample quality: Use representative data similar to what you’ll compress
- Similarity: Samples should share common structures or patterns
Compressing with a Dictionary
Once you have a trained dictionary, use it for compression.Load the dictionary
Create a Load the dictionary once and reuse it for multiple compressions.
ZSTD_CDict object from the dictionary file:Complete Example
Fromexamples/dictionary_compression.c:
Decompressing with a Dictionary
Decompression requires the same dictionary used for compression.Verify dictionary ID
Optionally verify the dictionary matches:Zstd writes the dictionary ID into the frame header by default.
Complete Example
Fromexamples/dictionary_decompression.c:
Advanced Dictionary Training
For more control over dictionary training, use the advanced API:Raw Content Dictionaries
You can use any buffer as a raw content dictionary without training:Performance Tips
- Reuse dictionary objects: Create
ZSTD_CDict/ZSTD_DDictonce and reuse for multiple operations - Match compression level: Train dictionaries at the compression level you’ll use in production
- Update periodically: Retrain dictionaries as your data evolves
- Test effectiveness: Use
zstd -bto benchmark with and without the dictionary