What is LZ77?
Lempel-Ziv 77 (LZ77) is a lossless data compression algorithm that works by replacing repeated sequences of data with references to earlier occurrences of that data. The algorithm was published by Abraham Lempel and Jacob Ziv in 1977 and forms the foundation of many modern compression formats including GZIP, ZIP, and PNG.How LZ77 Works
Basic Concept
LZ77 uses a “sliding window” to keep track of previously seen data. When the compressor encounters a sequence that matches earlier data, it outputs a reference instead of repeating the data. Each reference contains:- Distance - How far back the matching sequence starts (1-32,768 bytes)
- Length - How many bytes to copy from that position (3-258 bytes)
Example
Consider compressing the text:"the quick brown fox jumps over the lazy dog"
- Distance: 31 (31 characters back)
- Length: 3 (3 characters long)
t, h, e again, we store <31, 3> which takes fewer bits.
LZ77 Tokens
The LZ77 algorithm outputs a stream of tokens. Each token is either:Literal Token
A single byte that had no match or where a match wasn’t worthwhile.Match Token
A reference to previously seen data.The LZ77 Token Structure
Fromlz.h:8-13:
Compression Constraints
Sliding Window
The maximum distance for a match is 32,768 bytes (32 KB). This is the size of the “sliding window” that the algorithm can look back through.Why 32,768? This value balances compression ratio with memory usage. It’s small enough to fit in memory but large enough to find most repeated sequences.
Minimum Match Length
Matches must be at least 3 bytes long to be worthwhile. Reasoning: A match token needs to store both distance and length. If we only save 1-2 bytes, the overhead of encoding the distance/length pair would make the “compressed” data larger! For matches shorter than 3 bytes, output them as literals instead.Maximum Match Length
Matches can be at most 258 bytes long. Reasoning: This is defined by the DEFLATE specification to allow efficient encoding of the length value. Longer matches must be split into multiple match tokens.The Compression Process
Decide: match or literal?
- If match length ≥ 3: output a match token
- If match length < 3: output a literal token
Example Compression
Let’s compress the string"ABCABCABC":
Lit('A') Lit('B') Lit('C') Match(3,3) Match(3,3)
Finding the Best Match
The compressor should find the longest matching substring within the sliding window.Naive Approach
For each position, compare against every position in the previous 32,768 bytes:Greedy vs. Optimal
The compressor must decide: take the first good match found, or search for an even better match? Greedy approach: Take the longest match immediately Optimal approach: Look ahead to see if waiting might yield better overall compression For this assignment, a greedy approach is acceptable.Decompression Process
Decompression is much simpler than compression:Decompression Example
Given tokens:Lit('A') Lit('B') Lit('C') Match(3,3) Match(3,3)
The Wrap-Around Case
A special case occurs whendistance < length. This means we’re copying data that we’re currently writing!
Example: Run-Length Encoding
Consider the tokenMatch(distance=1, length=5) applied to output buffer [A]:
When distance < length, copy bytes one at a time, because each copied byte becomes available for subsequent copies.
Invalid Distance Codes
From the README (line 257):
Although distance values of 30 and 31 can technically be expressed, your lz_decompress function should stop and return an error if it encounters them.
Why LZ77?
Advantages
- Simple to decompress - Only needs a buffer and basic copy operations
- No dictionary needed - The dictionary is the recently decompressed data itself
- Adaptive - Automatically adapts to the data being compressed
- Lossless - Perfect reconstruction of original data
Limitations
- Limited look-back - Can only reference the previous 32 KB
- No long-range matches - Can’t reference data from earlier in large files
- Compression speed - Finding optimal matches can be slow
Combination with Huffman
LZ77 alone provides good compression, but it can be improved further:- LZ77 reduces redundancy by eliminating repeated sequences
- Huffman coding then compresses the LZ77 output by using fewer bits for common symbols
Position in DEFLATE Pipeline
- Compression (DEFLATE)
- Decompression (INFLATE)
The Lempel-Ziv scheme is the lowest-level compression scheme in the algorithm, so it is run first when compressing and last when decompressing.
Summary
LZ77 compression:- Replaces repeated sequences with (distance, length) references
- Uses a 32 KB sliding window
- Requires minimum 3-byte matches
- Allows maximum 258-byte matches
- Outputs a stream of literal and match tokens
- Handles wrap-around (distance < length) for run-length encoding
- Rejects invalid distance codes 30 and 31
Next Steps
Now that you understand the LZ77 algorithm, proceed to:- LZ77 Implementation - How to implement compression and decompression
- Huffman Algorithm - The next layer of compression