Overview
This page covers the implementation of LZ77 compression and decompression inlz.c. The LZ77 layer is the first compression step in DEFLATE and the last decompression step in INFLATE.
Required Functions
You must implement inlz.c:
Compression Function
Helper Function
Decompression
LZ77 decompression is performed within thehuffman_decode() function (see Part 3), since the data is Huffman-encoded before LZ77 decoding.
Data Structures
LZ77 Token
Fromlz.h:8-13:
Constants
Fromlz.h:5-6:
Compression Implementation
Function Signature
data- Input data to compresslen- Length of input data in bytesnum_tokens- Output parameter: number of tokens generated
- Pointer to malloc’d array of
lz_token_ton success NULLon error (NULL pointers, invalid data, etc.)
Algorithm
Allocate token array
Allocate space for tokens. In the worst case, every byte becomes a literal token, so allocate
len tokens.Find longest match
Use
find_match() to search the previous 32,768 bytes for the longest matching sequenceEmit token
- If match length ≥ 3: emit match token, advance by match length
- If match length < 3: emit literal token, advance by 1 byte
Pseudocode
Constraints
- Maximum distance: 32,768 bytes
- Minimum match length: 3 bytes
- Maximum match length: 258 bytes
- Return NULL on invalid data: NULL pointers, illegal values, etc.
Match Finding Implementation
Function Signature
data- Input data bufferpos- Current position in the datalen- Total length of dataoffset_limit- Maximum distance to search back (typically 32768 orpos)out_offset- Output parameter: distance to the best match found
- Length of the longest match found (0 if no match)
- Sets
*out_offsetto the distance of the best match
Algorithm
Update best match
If this match is longer than the current best, update best_length and best_distance
Pseudocode
Optimization Strategies
Hash Table Approach
Hash Table Approach
Instead of checking every position:
- Build a hash table mapping byte sequences to positions
- For the current position, hash the next 3 bytes
- Look up positions in the hash table
- Only check positions that hash to the same value
- Update hash table as you progress
Early Termination
Early Termination
If you find a match of length 258 (the maximum), stop searching - you can’t do better.
Lazy Matching
Lazy Matching
After finding a match, look ahead one position to see if a better match exists:
- If the next position has a longer match, use that instead
- This can improve compression ratio but requires more computation
Decompression Implementation
LZ77 decompression is performed withinhuffman_decode() in huff.c, because the Huffman layer must be decoded first to obtain the LZ77 tokens.
Decompression Logic
Whenhuffman_decode() produces a code:
Handling Wrap-Around
Whendistance < length, the copy wraps around:
By copying one byte at a time and incrementing the position, newly copied bytes become available for subsequent iterations. This creates run-length encoding.
Cross-Block References
Fromhuff.h:9-10:
history parameter provides access to previously decompressed data:
Invalid Distance Codes
From README line 257:
Although distance values of 30 and 31 can technically be expressed, your lz_decompress function should stop and return an error if it encounters them.
Length and Distance Code Tables
LZ77 lengths and distances are encoded using special codes plus extra bits.Length Codes (257-285)
From README lines 404-416:Distance Codes (0-29)
From README lines 418-430:Decoding Lengths and Distances
Helper functions inhuff.c:
Error Handling
Compression Errors
ReturnNULL if:
datais NULLnum_tokensis NULL- Memory allocation fails
- Invalid distance code generated
Decompression Errors
ReturnNULL if:
- Invalid distance code (30 or 31) encountered
- Distance exceeds available history + output
- Length + position exceeds buffer size
- Invalid length code
Testing LZ77
Test Cases
Simple Repetition
Simple Repetition
Input: Or:
"AAAAAA"Expected tokens:Text Repetition
Text Repetition
Input:
"the quick brown fox jumps over the lazy"Expected tokens:
Mostly literals, but the second “the” should be a match:No Matches
No Matches
Input:
"ABCDEFGH" (random data)Expected tokens:
All literals - no matches possibleMaximum Distance
Maximum Distance
Create data with a match exactly 32,768 bytes apart to test maximum distance.
Maximum Length
Maximum Length
Create data with 258+ consecutive matching bytes to test maximum length constraint.
Validation
For compression testing:- Compress data into tokens
- Manually decompress tokens
- Verify output matches input
Memory Management
Summary
Key implementation points:lz_compress_tokens()converts raw data to LZ77 tokensfind_match()searches for the longest match in the sliding window- LZ77 decompression happens inside
huffman_decode() - Minimum match length: 3 bytes
- Maximum match length: 258 bytes
- Maximum distance: 32,768 bytes
- Handle wrap-around case (distance < length)
- Reject distance codes 30 and 31
- Return NULL on all error conditions
- Free allocated memory
Next Steps
- Huffman Algorithm - Next compression layer
- Huffman Implementation - Implementing Huffman coding
- DEFLATE - Combining LZ77 and Huffman