Dictionary Training

Dictionary training is a powerful feature in Zstandard designed to dramatically improve compression ratios for small data, while simultaneously providing faster compression and decompression speeds.

The Small Data Problem

The smaller the amount of data to compress, the more difficult it is to compress. This problem is common to all compression algorithms. Why? Compression algorithms learn from past data how to compress future data. But at the beginning of a new data set, there is no “past” to build upon. From README.md:86-87

How Dictionary Training Works

Zstandard offers a training mode which can be used to tune the algorithm for a selected type of data by providing it with a few samples (one file per sample). The result of this training is stored in a file called a “dictionary”, which must be loaded before compression and decompression. From README.md:88-90

When to Use Dictionaries

Dictionary training works best when:

There is correlation in a family of small data samples
Each data sample is roughly 1KB or similar small size
You have multiple samples of the same data type

There is no universal dictionary. The more data-specific a dictionary is, the more efficient it is. Deploy one dictionary per type of data for greatest benefits.

From README.md:102-103

Effectiveness Range

Dictionary gains are mostly effective in the first few KB. Then, the compression algorithm will gradually use previously decoded content to better compress the rest of the file. From README.md:104-105

Training a Dictionary

Step 1: Create the Dictionary

Use the zstd --train command with a set of training samples:

zstd --train FullPathToTrainingSet/* -o dictionaryName

From README.md:110 and programs/README.md:122

Advanced Training Options

The CLI provides several training algorithms:

Cover Algorithm

zstd --train-cover=k=1024,d=8,steps=256 samples/* -o dict

Fast Cover Algorithm

zstd --train-fastcover=k=1024,d=8,f=20,steps=256,accel=5 samples/* -o dict

Legacy Algorithm

zstd --train-legacy=s=9 samples/* -o dict

From programs/README.md:239-245

Dictionary Size Limits

You can control the maximum dictionary size:

zstd --train samples/* -o dict --maxdict=65536

Default maximum dictionary size is 112640 bytes. From programs/README.md:246

Using Dictionaries

Step 2: Compress with Dictionary

zstd -D dictionaryName FILE

From README.md:114

Step 3: Decompress with Dictionary

zstd -D dictionaryName --decompress FILE.zst

From README.md:118

Programming API

Simple Dictionary Compression

// Load dictionary
size_t dictSize;
void* const dictBuffer = mallocAndLoadFile_orDie(dictFileName, &dictSize);
ZSTD_CDict* const cdict = ZSTD_createCDict(dictBuffer, dictSize, cLevel);

// Compress using dictionary
ZSTD_CCtx* const cctx = ZSTD_createCCtx();
size_t const cSize = ZSTD_compress_usingCDict(
    cctx, cBuff, cBuffSize, 
    fBuff, fSize, 
    cdict
);

// Cleanup
ZSTD_freeCDict(cdict);
free(dictBuffer);

From examples/dictionary_compression.c:28-36,53-56

Bulk Processing

When compressing multiple messages or blocks using the same dictionary, it’s recommended to digest the dictionary only once:

ZSTD_CDict* ZSTD_createCDict(
    const void* dictBuffer, 
    size_t dictSize,
    int compressionLevel
);

ZSTD_CDict can be created once and shared by multiple threads concurrently, since its usage is read-only.

From lib/zstd.h:987-992

Dictionary Decompression

ZSTD_DDict* ZSTD_createDDict(const void* dictBuffer, size_t dictSize);

size_t ZSTD_decompress_usingDDict(
    ZSTD_DCtx* dctx,
    void* dst, size_t dstCapacity,
    const void* src, size_t srcSize,
    const ZSTD_DDict* ddict
);

From lib/zstd.h:1023,1030-1036

Performance Gains

Real-world example using the github-users sample set (roughly 10K records of ~1KB each):

Compression Ratio Improvements

Dictionary compression achieves dramatically better compression ratios on small data compared to without dictionary.

Speed Improvements

These compression gains are achieved while simultaneously providing faster compression and decompression speeds. From README.md:100

The dictBuffer can be released after ZSTD_CDict creation, because its content is copied within CDict.

From lib/zstd.h:993

Dictionary ID

You can control the dictionary ID in the frame header:

zstd --train samples/* -o dict --dictID=12345

By default, a random dictionary ID is assigned. From programs/README.md:247

Querying Dictionary IDs

unsigned ZSTD_getDictID_fromDict(const void* dict, size_t dictSize);
unsigned ZSTD_getDictID_fromCDict(const ZSTD_CDict* cdict);
unsigned ZSTD_getDictID_fromDDict(const ZSTD_DDict* ddict);
unsigned ZSTD_getDictID_fromFrame(const void* src, size_t srcSize);

From lib/zstd.h:1047,1053,1059,1071

Command Line Dictionary Options

In compression:

zstd FILE -D dictionaryName

In decompression:

zstd --decompress FILE.zst -D dictionaryName

In benchmark mode:

zstd -b -D dictionary files

From programs/README.md:163,255

Get Started

Core Concepts

Guides

Integration

Dictionary Training

The Small Data Problem

How Dictionary Training Works

When to Use Dictionaries

Effectiveness Range

Training a Dictionary

Step 1: Create the Dictionary

Advanced Training Options

Cover Algorithm

Fast Cover Algorithm

Legacy Algorithm

Dictionary Size Limits

Using Dictionaries

Step 2: Compress with Dictionary

Step 3: Decompress with Dictionary

Programming API

Simple Dictionary Compression

Bulk Processing

Dictionary Decompression

Performance Gains

Compression Ratio Improvements

Speed Improvements

Dictionary ID

Querying Dictionary IDs

Command Line Dictionary Options

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Integration

​The Small Data Problem

​How Dictionary Training Works

​When to Use Dictionaries

​Effectiveness Range

​Training a Dictionary

​Step 1: Create the Dictionary

​Advanced Training Options

​Cover Algorithm

​Fast Cover Algorithm

​Legacy Algorithm

​Dictionary Size Limits

​Using Dictionaries

​Step 2: Compress with Dictionary

​Step 3: Decompress with Dictionary

​Programming API

​Simple Dictionary Compression

​Bulk Processing

​Dictionary Decompression

​Performance Gains

​Compression Ratio Improvements

​Speed Improvements

​Dictionary ID

​Querying Dictionary IDs

​Command Line Dictionary Options

Build docs developers (and LLMs) love

The Small Data Problem

How Dictionary Training Works

When to Use Dictionaries

Effectiveness Range

Training a Dictionary

Step 1: Create the Dictionary

Advanced Training Options

Cover Algorithm

Fast Cover Algorithm

Legacy Algorithm

Dictionary Size Limits

Using Dictionaries

Step 2: Compress with Dictionary

Step 3: Decompress with Dictionary

Programming API

Simple Dictionary Compression

Bulk Processing

Dictionary Decompression

Performance Gains

Compression Ratio Improvements

Speed Improvements

Dictionary ID

Querying Dictionary IDs

Command Line Dictionary Options