The Small Data Problem
The smaller the amount of data to compress, the more difficult it is to compress. This problem is common to all compression algorithms. Why? Compression algorithms learn from past data how to compress future data. But at the beginning of a new data set, there is no “past” to build upon. FromREADME.md:86-87
How Dictionary Training Works
Zstandard offers a training mode which can be used to tune the algorithm for a selected type of data by providing it with a few samples (one file per sample). The result of this training is stored in a file called a “dictionary”, which must be loaded before compression and decompression. FromREADME.md:88-90
When to Use Dictionaries
Dictionary training works best when:- There is correlation in a family of small data samples
- Each data sample is roughly 1KB or similar small size
- You have multiple samples of the same data type
README.md:102-103
Effectiveness Range
Dictionary gains are mostly effective in the first few KB. Then, the compression algorithm will gradually use previously decoded content to better compress the rest of the file. FromREADME.md:104-105
Training a Dictionary
Step 1: Create the Dictionary
Use thezstd --train command with a set of training samples:
README.md:110 and programs/README.md:122
Advanced Training Options
The CLI provides several training algorithms:Cover Algorithm
Fast Cover Algorithm
Legacy Algorithm
programs/README.md:239-245
Dictionary Size Limits
You can control the maximum dictionary size:programs/README.md:246
Using Dictionaries
Step 2: Compress with Dictionary
README.md:114
Step 3: Decompress with Dictionary
README.md:118
Programming API
Simple Dictionary Compression
examples/dictionary_compression.c:28-36,53-56
Bulk Processing
When compressing multiple messages or blocks using the same dictionary, it’s recommended to digest the dictionary only once:lib/zstd.h:987-992
Dictionary Decompression
lib/zstd.h:1023,1030-1036
Performance Gains
Real-world example using thegithub-users sample set (roughly 10K records of ~1KB each):
Compression Ratio Improvements
Dictionary compression achieves dramatically better compression ratios on small data compared to without dictionary.Speed Improvements
These compression gains are achieved while simultaneously providing faster compression and decompression speeds. FromREADME.md:100
The
dictBuffer can be released after ZSTD_CDict creation, because its content is copied within CDict.lib/zstd.h:993
Dictionary ID
You can control the dictionary ID in the frame header:programs/README.md:247
Querying Dictionary IDs
lib/zstd.h:1047,1053,1059,1071
Command Line Dictionary Options
In compression:programs/README.md:163,255