What is Huffman Coding?
Huffman coding is a lossless data compression algorithm that assigns variable-length bit codes to symbols based on their frequency. Frequently occurring symbols get shorter codes, while rare symbols get longer codes. The algorithm was developed by David A. Huffman in 1952 while he was a Ph.D. student at MIT.How Huffman Coding Works
Basic Concept
In normal (uncompressed) data, each byte is represented by 8 bits. But if we know some bytes occur more frequently than others, we can use fewer bits for common bytes and more bits for rare bytes, achieving overall compression.Example
Consider the string:"AAAAAABBBCCD"
Frequency count:
- A: 6 times
- B: 3 times
- C: 2 times
- D: 1 time
- A:
0(1 bit) - most frequent - B:
10(2 bits) - C:
110(3 bits) - D:
111(3 bits) - least frequent
0000000 (6 A’s) + 101010 (3 B’s) + 110110 (2 C’s) + 111 (1 D) = 22 bits
Compression ratio: 22 / 96 = 23% of original size!
The Huffman Tree
Huffman coding uses a binary tree to determine the bit code for each symbol.Tree Structure
Fromhuff.h and described in README line 272-278:
Leaf vs Internal Nodes
- Leaf nodes: Store a symbol value and its frequency, have no children
- Internal nodes: Store sum of children’s frequencies, symbol = -1
Encoding from the Tree
To find the code for a symbol:- Start at the root
- Traverse to the leaf containing the symbol
- Record each branch: left = 0, right = 1
- The sequence of 0’s and 1’s is the Huffman code
Building a Huffman Tree
Build tree bottom-up
Repeatedly:
- Find the two nodes with lowest frequencies
- Create a parent node with frequency = sum of children
- Remove the two nodes, add the parent
Example Construction
For"AAAAAABBBCCD":
The exact tree structure can vary based on tie-breaking rules when frequencies are equal. The assignment requires a canonical Huffman code to ensure consistency.
Canonical Huffman Codes
Canonical Huffman codes ensure that trees can be reconstructed consistently from just the code lengths.Properties
- Same lengths: Symbols with the same code length are in lexicographical order
- Consecutive codes: Codes of the same length are numerically consecutive
- Prefix-free: No code is a prefix of another
Why Canonical?
Canonical codes allow efficient serialization:- Instead of storing the entire tree structure
- Store only the code lengths for each symbol
- Receiver can reconstruct the exact tree from lengths
A rule is added so that a Huffman tree can be consistently constructed in the same way: all byte values with the same encoded length/at the same level of the tree must be in lexicographical (alphabetical) order
Encoding with Huffman
Building Code Tables
Once you have a Huffman tree, generate code tables:build_codes() helper function:
Encoding Data
To encode a symbol:- Look up its code and length in the tables
- Write the code bits to the output stream (MSB first for Huffman codes)
Decoding with Huffman
Tree-Based Decoding
To decode:- Start at the root of the tree
- Read one bit from the input stream
- Go left (0) or right (1)
- If you reach a leaf, output the symbol and return to root
- Repeat until all data is decoded
Table-Based Decoding
For efficiency, use a decoder table instead of traversing the tree:build_decoder() function creates tables for O(1) symbol lookup.
Reconstructing Trees from Code Lengths
This is a crucial part of decompression. Given only code lengths, reconstruct the Huffman tree.Algorithm
From README lines 305-308:Implementation
Thecanonical_codes() helper function:
Static vs Dynamic Huffman
The DEFLATE format supports two types of Huffman coding:Static Huffman (BTYPE = 01)
Uses predefined codes specified in RFC 1951: From README lines 467-477:00000 through 11101 (codes 0-29).
Advantages:
- No tree data to transmit
- Simple to implement
- Fast to decode
- Not optimized for specific data
- May not compress as well as dynamic
Dynamic Huffman (BTYPE = 10)
Builds custom Huffman trees optimized for the specific data being compressed. The trees are transmitted as part of the compressed data:- Build optimal literal/length tree
- Build optimal distance tree
- Extract code lengths from both trees
- Build a third tree to compress the code lengths (meta-compression!)
- Transmit the code length tree and the compressed code lengths
- Optimized for actual data
- Better compression ratio
- More complex
- Tree data adds overhead
- Only worthwhile for larger blocks
Code Length Encoding
Dynamic Huffman blocks use a clever trick: they compress the code lengths themselves!Code Length Alphabet
From README lines 490-500:Example
Code lengths:[8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 5]
Compressed: [8, 16+3 (copy 8 five more times), 17+1 (four zeros), 5]
This run-length encoding reduces the amount of data needed to describe the tree.
Code Length Order
Code lengths are transmitted in a specific order (most common lengths first): From README line 486:Helper Functions
From README lines 311-319, the following helper functions are recommended:Tree Building
Code Generation
Encoding Helpers
Decoding Helpers
Memory Management
Bit-Level Operations
Bit Packing Order
From README lines 384-395:Byte Layout
Reading Bits
Bits are read from LSB to MSB within each byte:Writing Bits
Similarly, bits are written from LSB to MSB:End-of-Block Symbol
All Huffman blocks end with symbol 256 (not to be confused with byte value 0x00). From README line 398:The way to distinguish between a length-offset tuple and a literal byte is that literal bytes have a value less than 256.Symbol meanings:
- 0-255: Literal bytes
- 256: End of block
- 257-285: Length codes (for LZ77 matches)
Example Encoding Flow
- Encoding
- Decoding
Summary
Key Huffman coding concepts:- Variable-length codes based on symbol frequency
- Frequent symbols get short codes, rare symbols get long codes
- Codes are prefix-free (no code is a prefix of another)
- Binary tree represents the codes
- Canonical codes allow reconstruction from lengths only
- Static Huffman uses predefined codes
- Dynamic Huffman builds optimized trees
- Code lengths are compressed using run-length encoding
- Symbol 256 marks end-of-block
- Bit packing order matters (LSB to MSB for data, MSB first for codes)
Next Steps
- Huffman Implementation - Implementing encoding and decoding
- GZIP Format - File format and member structure
- DEFLATE - Combining LZ77 and Huffman for compression
- INFLATE - Decompression process