Dictionary Training

Dictionary training analyzes sample data to create optimized dictionaries for compressing similar files. The resulting dictionary can dramatically improve compression ratios for small data.

ZDICT_trainFromBuffer()

Train a dictionary from an array of samples using the fast COVER algorithm.

size_t ZDICT_trainFromBuffer(
    void* dictBuffer,
    size_t dictBufferCapacity,
    const void* samplesBuffer,
    const size_t* samplesSizes,
    unsigned nbSamples
);

dictBuffer

void*

Output buffer where the trained dictionary will be stored.

dictBufferCapacity

size_t

Maximum size of the output dictionary buffer. Recommended: ~100 KB.

samplesBuffer

const void*

Input buffer containing all samples concatenated together.

samplesSizes

const size_t*

Array containing the size of each sample, in order.

nbSamples

unsigned

Number of samples provided. Recommended: provide ~100x the dictionary size in total samples.

Returns

Size of dictionary stored into dictBuffer (<= dictBufferCapacity), or an error code which can be tested with ZDICT_isError().

Notes

This function redirects to ZDICT_optimizeTrainFromBuffer_fastCover() with default parameters (d=8, steps=4, f=20, accel=1)
Memory usage is about 6 MB
Training will fail if there are not enough samples or if samples are too small (< 8 bytes)
Recommended to provide a few thousand samples totaling ~100x the target dictionary size

Example

#define DICT_SIZE (100 * 1024)  // 100 KB
#define NUM_SAMPLES 1000

size_t sampleSizes[NUM_SAMPLES];
void* samplesBuffer = /* load concatenated samples */;
void* dictBuffer = malloc(DICT_SIZE);

size_t dictSize = ZDICT_trainFromBuffer(
    dictBuffer, DICT_SIZE,
    samplesBuffer, sampleSizes, NUM_SAMPLES
);

if (ZDICT_isError(dictSize)) {
    fprintf(stderr, "Dictionary training failed: %s\n",
            ZDICT_getErrorName(dictSize));
}

ZDICT_finalizeDictionary()

Convert raw dictionary content into a zstd dictionary by adding headers and entropy tables.

size_t ZDICT_finalizeDictionary(
    void* dstDictBuffer,
    size_t maxDictSize,
    const void* dictContent,
    size_t dictContentSize,
    const void* samplesBuffer,
    const size_t* samplesSizes,
    unsigned nbSamples,
    ZDICT_params_t parameters
);

dstDictBuffer

void*

Output buffer for the finalized dictionary. Can overlap with dictContent.

maxDictSize

size_t

Maximum size of the output dictionary. Must be >= max(dictContentSize, ZDICT_DICTSIZE_MIN).

dictContent

const void*

Raw dictionary content (can be from any source, not just zstd training).

dictContentSize

size_t

Size of the raw dictionary content.

samplesBuffer

const void*

Buffer containing concatenated samples for building entropy tables.

samplesSizes

const size_t*

Array of sizes for each sample.

nbSamples

unsigned

Number of samples provided.

parameters

ZDICT_params_t

Dictionary parameters:

compressionLevel: Optimize for specific compression level (0 = default)
notificationLevel: Log verbosity (0-4, where 0 = none)
dictID: Force specific dictionary ID (0 = auto-generate random ID)

Returns

Size of dictionary stored into dstDictBuffer (<= maxDictSize), or an error code which can be tested with ZDICT_isError().

Notes

Adds zstd header with magic number, dictionary ID, and entropy tables
Samples are used to construct statistics for the compression level specified
If header + content doesn’t fit in maxDictSize, content is truncated from the beginning
Most profitable content is presumed to be at the end of the dictionary
May fail if not enough samples, samples are uncompressible, or all samples are identical

Example

// Convert raw content to zstd dictionary
char rawDict[1024] = /* custom dictionary content */;
void* samples = /* load samples */;
size_t sampleSizes[100];

ZDICT_params_t params;
memset(&params, 0, sizeof(params));
params.compressionLevel = 3;
params.notificationLevel = 2;  // Show progress
params.dictID = 0;  // Auto-generate

void* dictBuffer = malloc(110 * 1024);
size_t dictSize = ZDICT_finalizeDictionary(
    dictBuffer, 110 * 1024,
    rawDict, sizeof(rawDict),
    samples, sampleSizes, 100,
    params
);

if (ZDICT_isError(dictSize)) {
    fprintf(stderr, "Failed: %s\n", ZDICT_getErrorName(dictSize));
}

Helper Functions

ZDICT_getDictID()

Extract the dictionary ID from a dictionary buffer.

unsigned ZDICT_getDictID(const void* dictBuffer, size_t dictSize);

Returns the dictionary ID, or 0 if the buffer is not a valid zstd dictionary.

ZDICT_isError()

Test if a return value indicates an error.

unsigned ZDICT_isError(size_t errorCode);

Returns 1 if error, 0 otherwise.

ZDICT_getErrorName()

Get a human-readable error message.

const char* ZDICT_getErrorName(size_t errorCode);

Returns a string describing the error.

Simple API

Streaming API

Dictionary API

Advanced

Dictionary Training

ZDICT_trainFromBuffer()

Returns

Notes

Example

ZDICT_finalizeDictionary()

Returns

Notes

Example

Helper Functions

ZDICT_getDictID()

ZDICT_isError()

ZDICT_getErrorName()

Build docs developers (and LLMs) love

Simple API

Streaming API

Dictionary API

Advanced

​ZDICT_trainFromBuffer()

​Returns

​Notes

​Example

​ZDICT_finalizeDictionary()

​Returns

​Notes

​Example

​Helper Functions

​ZDICT_getDictID()

​ZDICT_isError()

​ZDICT_getErrorName()

Build docs developers (and LLMs) love

ZDICT_trainFromBuffer()

Returns

Notes

Example

ZDICT_finalizeDictionary()

Returns

Notes

Example

Helper Functions

ZDICT_getDictID()

ZDICT_isError()

ZDICT_getErrorName()