Overview
The LZ77 compression module provides functions to compress data by replacing repeated sequences with references to their previous occurrences. Each reference contains an offset and length component, with a maximum distance of 32,768 bytes and maximum length of 258 bytes.Data Structures
lz_token_t
Represents either a literal byte or a length-distance pair in the LZ77 compressed stream.Flag indicating token type: 1 for literal byte, 0 for length-distance pair
The literal byte value (only valid when
is_literal == 1). Set to 0 for length-distance pairsLength component of the match (only valid when
is_literal == 0). Range: 3-258 bytes. Set to 0 for literalsDistance/offset to the start of the repeated sequence (only valid when
is_literal == 0). Range: 1-32768 bytes. Set to 0 for literalsConstants
Functions
lz_compress_tokens
Compresses input data using the LZ77 algorithm and returns an array of tokens.Pointer to the input data buffer to compress
Length of the input data in bytes
Output parameter that receives the number of tokens in the returned array
Pointer to a dynamically allocated array of
lz_token_t structures on success, or NULL on errorBehavior
- Searches for the longest matching substring within the previous 32,768 bytes of the output buffer
- If a match of at least 3 bytes is found, creates a length-distance token
- If no match of at least 3 bytes is found, creates literal byte tokens
- The compressor must find the longest possible match within the sliding window
Error Conditions
- Returns
NULLifdataisNULL - Returns
NULLif illegal distance codes are encountered - Returns
NULLon any invalid input data
Memory Management
The returned array is dynamically allocated usingmalloc(). The caller is responsible for freeing the memory using free() when done.
Example
lz_to_length_distance_codes
Converts raw data to length-distance code representation.Pointer to the input data buffer
Length of the input data in bytes
Output parameter that receives the length of the encoded output
Pointer to a dynamically allocated buffer containing the encoded data, or
NULL on errorDecompression
LZ77 decompression is performed within thehuffman_decode() function (see Huffman API). The decompression process:
- Reads length-distance pairs from the Huffman-decoded stream
- Copies bytes from the output buffer at the specified distance
- Handles cases where the match extends beyond the sliding window
Sliding Window Overflow
When a length-distance pair references beyond the current position (e.g., distance=1, length=5):- Copy the available bytes from the output buffer (n bytes)
- Continue copying with the new position until all length bytes are copied
- This effectively repeats the pattern byte-by-byte
Invalid Distance Codes
Distance codes 30 and 31 are reserved and invalid. The decompressor must return an error if these codes are encountered.Implementation Notes
- The sliding window size is fixed at 32,768 bytes (32 KB)
- Minimum match length is 3 bytes
- Maximum match length is 258 bytes
- The compressor should use an efficient search algorithm (e.g., hash table) to find matches
- All multi-byte values are stored in little-endian format