Huffman Implementation

Overview

This page covers the implementation of Huffman coding in huff.c. Huffman coding is the second compression layer in DEFLATE and the first decompression layer in INFLATE.

Required Functions

You must implement in huff.c:

Encoding Functions

unsigned char* huffman_encode_tokens(
    const lz_token_t* tokens, 
    size_t num_tokens, 
    unsigned long* bits_written, 
    size_t* out_len, 
    unsigned char *returned_btype)

static unsigned char* encode_dynamic_huffman_tokens(
    const lz_token_t* tokens, 
    size_t num_tokens, 
    unsigned long* bits_written, 
    size_t* out_len)

static unsigned char* encode_fixed_huffman_tokens(
    const lz_token_t* tokens, 
    size_t num_tokens, 
    unsigned long* bits_written, 
    size_t* out_len)

Decoding Function

unsigned char* huffman_decode(
    const unsigned char* data, 
    size_t enc_len, 
    unsigned int btype_val, 
    unsigned long *bits_read, 
    size_t* out_len, 
    const unsigned char* history, 
    size_t history_len)

Data Structures

Huffman Tree Node

From README line 272-278:

typedef struct huff_node {
    struct huff_node* left;   // Left child (0 bit)
    struct huff_node* right;  // Right child (1 bit)
    int symbol;               // Symbol value (-1 for internal nodes)
    unsigned long freq;       // Frequency count
} huff_node_t;

Decoder Structure

For efficient decoding:

typedef struct {
    int* symbols;    // Symbol lookup table
    int* counts;     // Count of codes per length
    int* offsets;    // Offset into symbols for each length
} huff_decoder_t;

Encoding Implementation

Main Encoding Function

unsigned char* huffman_encode_tokens(
    const lz_token_t* tokens, 
    size_t num_tokens, 
    unsigned long* bits_written, 
    size_t* out_len, 
    unsigned char *returned_btype)

Parameters:

tokens - Array of LZ77 tokens to encode
num_tokens - Number of tokens
bits_written - Output: number of bits written
out_len - Output: number of bytes allocated
returned_btype - Output: block type used (1=static, 2=dynamic)

Returns:

Malloc’d buffer containing Huffman-encoded data
NULL on error

Algorithm:

Validate inputs

Check for NULL pointers, invalid token data

Try both encodings

Encode with both static and dynamic Huffman

Choose best encoding

Compare sizes and use the smaller one

Return result

Set returned_btype and return the chosen encoding

Pseudocode:

unsigned char* huffman_encode_tokens(
    const lz_token_t* tokens, 
    size_t num_tokens, 
    unsigned long* bits_written, 
    size_t* out_len, 
    unsigned char *returned_btype) {
    
    // Validate inputs
    if (!tokens || !bits_written || !out_len || !returned_btype) {
        return NULL;
    }
    
    // Try static Huffman
    unsigned long static_bits = 0;
    size_t static_len = 0;
    unsigned char* static_result = encode_fixed_huffman_tokens(
        tokens, num_tokens, &static_bits, &static_len);
    
    // Try dynamic Huffman
    unsigned long dynamic_bits = 0;
    size_t dynamic_len = 0;
    unsigned char* dynamic_result = encode_dynamic_huffman_tokens(
        tokens, num_tokens, &dynamic_bits, &dynamic_len);
    
    // Choose the better option
    if (dynamic_bits < static_bits && dynamic_result) {
        *bits_written = dynamic_bits;
        *out_len = dynamic_len;
        *returned_btype = BT_DYNAMIC;  // 2
        free(static_result);
        return dynamic_result;
    } else {
        *bits_written = static_bits;
        *out_len = static_len;
        *returned_btype = BT_STATIC;   // 1
        free(dynamic_result);
        return static_result;
    }
}

Static Huffman Encoding

static unsigned char* encode_fixed_huffman_tokens(
    const lz_token_t* tokens, 
    size_t num_tokens, 
    unsigned long* bits_written, 
    size_t* out_len)

Algorithm:

Set up predefined codes

Use the fixed codes from RFC 1951

Allocate output buffer

Estimate size and allocate buffer

Encode each token

Convert LZ77 tokens to Huffman codes

Write end-of-block

Write symbol 256

Fixed Code Tables: From README lines 467-477 and our_zlib.h:37-48:

// Literal/length codes
// 0-143: 8 bits, codes 00110000 - 10111111
// 144-255: 9 bits, codes 110010000 - 111111111  
// 256-279: 7 bits, codes 0000000 - 0010111
// 280-287: 8 bits, codes 11000000 - 11000111

// Distance codes (all 5 bits)
// 0-29: codes 00000 - 11101

Implementation:

static unsigned char* encode_fixed_huffman_tokens(
    const lz_token_t* tokens, 
    size_t num_tokens, 
    unsigned long* bits_written, 
    size_t* out_len) {
    
    // Set up fixed code tables
    unsigned long lit_codes[286];
    unsigned char lit_lens[286];
    
    for (int i = 0; i <= 143; i++) {
        lit_codes[i] = 0b00110000 + i;
        lit_lens[i] = 8;
    }
    for (int i = 144; i <= 255; i++) {
        lit_codes[i] = 0b110010000 + (i - 144);
        lit_lens[i] = 9;
    }
    for (int i = 256; i <= 279; i++) {
        lit_codes[i] = 0b0000000 + (i - 256);
        lit_lens[i] = 7;
    }
    for (int i = 280; i <= 287; i++) {
        lit_codes[i] = 0b11000000 + (i - 280);
        lit_lens[i] = 8;
    }
    
    // Distance codes (all 5 bits)
    unsigned long dist_codes[30];
    unsigned char dist_lens[30];
    for (int i = 0; i < 30; i++) {
        dist_codes[i] = i;
        dist_lens[i] = 5;
    }
    
    // Allocate output buffer
    size_t max_bits = num_tokens * 16;  // Conservative estimate
    size_t buf_size = (max_bits + 7) / 8;
    unsigned char* output = calloc(buf_size, 1);
    unsigned long bit_pos = 0;
    
    // Encode each token
    for (size_t i = 0; i < num_tokens; i++) {
        if (tokens[i].is_literal) {
            // Write literal code
            write_huffman_code(output, &bit_pos, 
                lit_codes[tokens[i].literal], 
                lit_lens[tokens[i].literal]);
        } else {
            // Convert length to code
            unsigned int extra_bits;
            int len_code = length_to_code(tokens[i].length, &extra_bits);
            write_huffman_code(output, &bit_pos, 
                lit_codes[len_code], lit_lens[len_code]);
            // Write extra bits for length
            write_bits_lsb(output, &bit_pos, extra_bits, num_extra_bits);
            
            // Convert distance to code
            int dist_code = distance_to_code(tokens[i].distance, &extra_bits);
            write_huffman_code(output, &bit_pos, 
                dist_codes[dist_code], dist_lens[dist_code]);
            // Write extra bits for distance  
            write_bits_lsb(output, &bit_pos, extra_bits, num_extra_bits);
        }
    }
    
    // Write end-of-block (256)
    write_huffman_code(output, &bit_pos, lit_codes[256], lit_lens[256]);
    
    *bits_written = bit_pos;
    *out_len = (bit_pos + 7) / 8;
    return output;
}

Dynamic Huffman Encoding

static unsigned char* encode_dynamic_huffman_tokens(
    const lz_token_t* tokens, 
    size_t num_tokens, 
    unsigned long* bits_written, 
    size_t* out_len)

Algorithm:

Count frequencies

Count frequencies of all literal/length symbols (0-285) and distance symbols (0-29)

Build Huffman trees

Build optimal trees for literals/lengths and distances

Extract code lengths

Get the bit length for each symbol’s code

Build code length tree

Compress the code lengths using run-length encoding and build a third tree

Write header

Write HLIT, HDIST, HCLEN values

Write code length tree

Transmit the code lengths for the code length alphabet

Write compressed code lengths

Encode literal/length and distance code lengths using the code length tree

Write data

Encode all tokens using the optimal Huffman codes

This is the most complex function in the assignment. See the flowcharts in the README (lines 551-726) for visual guidance.

Decoding Implementation

Main Decoding Function

unsigned char* huffman_decode(
    const unsigned char* data, 
    size_t enc_len, 
    unsigned int btype_val, 
    unsigned long *bits_read, 
    size_t* out_len, 
    const unsigned char* history, 
    size_t history_len)

Parameters:

data - Compressed data to decode
enc_len - Length of compressed data
btype_val - Block type (0=uncompressed, 1=static, 2=dynamic)
bits_read - Output: number of bits consumed
out_len - Output: length of decompressed data
history - Previously decompressed data (for cross-block references)
history_len - Length of history buffer

Returns:

Malloc’d buffer containing decompressed data
NULL on error

This function must also perform LZ77 decompression as it decodes Huffman symbols.

Algorithm:

Validate inputs

Check for NULL pointers, invalid btype_val

Handle uncompressed blocks

If btype_val == 0, copy data directly (see GZIP format)

Set up Huffman trees

If btype_val == 1: use fixed codes
If btype_val == 2: read and build dynamic trees
If btype_val == 3: return error (reserved)

Decode symbols

Read Huffman codes and decode symbols until end-of-block (256)

Process symbols

Literals (0-255): append to output
End-of-block (256): stop
Lengths (257-285): read distance and perform LZ77 copy

Return result

Set out_len and return decompressed buffer

Pseudocode:

unsigned char* huffman_decode(
    const unsigned char* data, 
    size_t enc_len, 
    unsigned int btype_val, 
    unsigned long *bits_read, 
    size_t* out_len, 
    const unsigned char* history, 
    size_t history_len) {
    
    // Validate
    if (!data || !bits_read || !out_len) return NULL;
    if (btype_val == 3) return NULL;  // Reserved block type
    
    unsigned long bit_pos = 0;
    
    // Handle uncompressed blocks
    if (btype_val == BT_NO_COMPRESSION) {
        // Skip to byte boundary
        bit_pos = (bit_pos + 7) & ~7;
        
        // Read LEN and NLEN
        unsigned short len = read_16_lsb(data, &bit_pos);
        unsigned short nlen = read_16_lsb(data, &bit_pos);
        
        // Verify one's complement
        if ((len ^ nlen) != 0xFFFF) return NULL;
        
        // Copy data
        unsigned char* output = malloc(len);
        memcpy(output, data + bit_pos/8, len);
        bit_pos += len * 8;
        
        *bits_read = bit_pos;
        *out_len = len;
        return output;
    }
    
    // Set up decoder
    huff_decoder_t lit_decoder, dist_decoder;
    
    if (btype_val == BT_STATIC) {
        // Use fixed codes
        setup_fixed_decoders(&lit_decoder, &dist_decoder);
    } else {
        // Read dynamic trees
        if (!read_dynamic_trees(data, &bit_pos, &lit_decoder, &dist_decoder)) {
            return NULL;
        }
    }
    
    // Decode data
    size_t output_capacity = 65536;  // Start with reasonable size
    unsigned char* output = malloc(output_capacity);
    size_t output_pos = 0;
    
    while (1) {
        // Decode next symbol
        int symbol = decode_symbol(&lit_decoder, data, &bit_pos);
        
        if (symbol < 0) {
            // Decode error
            free(output);
            return NULL;
        } else if (symbol < 256) {
            // Literal byte
            if (output_pos >= output_capacity) {
                output_capacity *= 2;
                output = realloc(output, output_capacity);
            }
            output[output_pos++] = (unsigned char)symbol;
        } else if (symbol == 256) {
            // End of block
            break;
        } else if (symbol <= 285) {
            // Length/distance pair
            int length = decode_length_value(symbol, data, &bit_pos);
            int dist_code = decode_symbol(&dist_decoder, data, &bit_pos);
            
            if (dist_code == 30 || dist_code == 31) {
                // Invalid distance codes
                free(output);
                return NULL;
            }
            
            int distance = decode_distance_value(dist_code, data, &bit_pos);
            
            // Perform LZ77 copy
            for (int i = 0; i < length; i++) {
                if (output_pos >= output_capacity) {
                    output_capacity *= 2;
                    output = realloc(output, output_capacity);
                }
                
                // Handle cross-block references
                if (distance <= output_pos) {
                    output[output_pos] = output[output_pos - distance];
                } else if (history) {
                    size_t hist_pos = history_len - (distance - output_pos);
                    output[output_pos] = history[hist_pos];
                } else {
                    free(output);
                    return NULL;  // Reference beyond available data
                }
                output_pos++;
            }
        }
    }
    
    *bits_read = bit_pos;
    *out_len = output_pos;
    return output;
}

Reading Dynamic Trees

From README lines 729-869, here’s the process:

Read HLIT (5 bits)

Number of literal/length codes - 257 (actual count = HLIT + 257)

Read HDIST (5 bits)

Number of distance codes - 1 (actual count = HDIST + 1)

Read HCLEN (4 bits)

Number of code length codes - 4 (actual count = HCLEN + 4)

Read code length alphabet

Read (HCLEN + 4) × 3 bits in the order: 16,17,18,0,8,7,9,6,10,5,11,4,12,3,13,2,14,1,15

Build code length tree

Construct Huffman tree for code lengths (0-18)

Decode literal/length code lengths

Read (HLIT + 257) code length values using the code length tree

Decode distance code lengths

Read (HDIST + 1) code length values using the code length tree

Build main trees

Construct literal/length and distance Huffman trees from the code lengths

Handling Code Length Symbols

When decoding code lengths, symbols have special meanings:

0-15: Literal code length value
16: Repeat previous code length 3-6 times (read 2 extra bits)
17: Repeat code length 0 for 3-10 times (read 3 extra bits)
18: Repeat code length 0 for 11-138 times (read 7 extra bits)

for (int i = 0; i < num_codes; ) {
    int symbol = decode_symbol(&codelength_decoder, data, &bit_pos);
    
    if (symbol <= 15) {
        code_lengths[i++] = symbol;
    } else if (symbol == 16) {
        int repeat = 3 + read_bits(data, &bit_pos, 2);
        int prev = code_lengths[i - 1];
        for (int j = 0; j < repeat; j++) {
            code_lengths[i++] = prev;
        }
    } else if (symbol == 17) {
        int repeat = 3 + read_bits(data, &bit_pos, 3);
        for (int j = 0; j < repeat; j++) {
            code_lengths[i++] = 0;
        }
    } else if (symbol == 18) {
        int repeat = 11 + read_bits(data, &bit_pos, 7);
        for (int j = 0; j < repeat; j++) {
            code_lengths[i++] = 0;
        }
    }
}

Helper Functions

Implement these helper functions for cleaner code:

Tree Building

static huff_node_t* build_huffman_tree_from_freq(
    unsigned int *frequencies, 
    int num_symbols, 
    unsigned long *codes, 
    unsigned char *lens)

Use a priority queue to build the tree bottom-up (see queue.h).

Code Generation

static void build_codes(
    const huff_node_t* node, 
    unsigned long code, 
    int len, 
    unsigned long* codes, 
    unsigned char* lens)

Recursively traverse tree to generate codes.

static void canonical_codes(
    unsigned char* lens, 
    unsigned long* codes, 
    unsigned int range)

Generate canonical codes from lengths only.

Decoder Building

static void build_decoder(
    huff_decoder_t* dec, 
    unsigned char* lens, 
    int num_symbols)

Create lookup tables for fast decoding.

Symbol Decoding

static int decode_symbol(
    huff_decoder_t* dec, 
    const unsigned char* data, 
    unsigned long* bits_read)

Decode one symbol from the bit stream.

Length/Distance Conversion

static int length_to_code(unsigned int length, unsigned int *extra_val)
static int distance_to_code(unsigned int distance, unsigned int *extra_val)

Convert values to codes and extra bits (use the tables from the README).

Memory Management

static void free_tree(huff_node_t* n)

Recursively free tree nodes.

Error Handling

Return NULL if:

Any required pointer is NULL
btype_val is 3 (reserved)
bits_read is NULL (required output)
Invalid Huffman codes encountered
Distance codes 30 or 31
Memory allocation fails
Data ends prematurely
Invalid code length sequences

Testing

Test Static Huffman

Create simple LZ77 tokens
Encode with static Huffman
Decode and verify

Test Dynamic Huffman

Create tokens with non-uniform frequency
Encode with dynamic Huffman
Verify dynamic is smaller than static
Decode and verify

Test Code Length Encoding

Create code lengths with repetitions
Verify symbols 16, 17, 18 are used
Decode and verify lengths match

Test LZ77 Integration

Create tokens with length/distance pairs
Encode and decode
Verify LZ77 decompression works
Test wrap-around case (distance < length)

Memory Management

All encode and decode functions return malloc’d buffers. The caller must free them.

size_t out_len;
unsigned char* result = huffman_decode(data, len, btype, &bits, &out_len, NULL, 0);
if (!result) {
    // Handle error
}

// Use result...

free(result);  // Don't forget!

Also free intermediate structures:

Huffman trees (use free_tree())
Decoder tables
Temporary buffers

Summary

Key implementation points:

huffman_encode_tokens() tries both static and dynamic, chooses best
Static Huffman uses predefined codes from RFC 1951
Dynamic Huffman builds optimal trees and encodes them
huffman_decode() handles all block types and performs LZ77 decompression
Code lengths are compressed using run-length encoding (symbols 16, 17, 18)
Symbol 256 marks end-of-block
Distance codes 30 and 31 are invalid
Return NULL on all errors
Free all allocated memory

Next Steps

GZIP Format - File format and member headers
DEFLATE - Complete compression pipeline
INFLATE - Complete decompression pipeline

Assignment

Part 1: CLI & Arguments

Part 2: LZ77 Compression

Part 3: Huffman Coding

Part 4: DEFLATE/INFLATE

API Reference

Overview

Required Functions

Encoding Functions

Decoding Function

Data Structures

Huffman Tree Node

Decoder Structure

Encoding Implementation

Main Encoding Function

Static Huffman Encoding

Dynamic Huffman Encoding

Decoding Implementation

Main Decoding Function

Reading Dynamic Trees

Handling Code Length Symbols

Helper Functions

Tree Building

Code Generation

Decoder Building

Symbol Decoding

Length/Distance Conversion

Memory Management

Error Handling

Testing

Test Static Huffman

Test Dynamic Huffman

Test Code Length Encoding

Test LZ77 Integration

Memory Management

Summary

Next Steps

Build docs developers (and LLMs) love

Assignment

Part 1: CLI & Arguments

Part 2: LZ77 Compression

Part 3: Huffman Coding

Part 4: DEFLATE/INFLATE

API Reference

​Overview

​Required Functions

​Encoding Functions

​Decoding Function

​Data Structures

​Huffman Tree Node

​Decoder Structure

​Encoding Implementation

​Main Encoding Function

​Static Huffman Encoding

​Dynamic Huffman Encoding

​Decoding Implementation

​Main Decoding Function

​Reading Dynamic Trees

​Handling Code Length Symbols

​Helper Functions

​Tree Building

​Code Generation

​Decoder Building

​Symbol Decoding

​Length/Distance Conversion

​Memory Management

​Error Handling

​Testing

​Test Static Huffman

​Test Dynamic Huffman

​Test Code Length Encoding

​Test LZ77 Integration

​Memory Management

​Summary

​Next Steps

Build docs developers (and LLMs) love

Overview

Required Functions

Encoding Functions

Decoding Function

Data Structures

Huffman Tree Node

Decoder Structure

Encoding Implementation

Main Encoding Function

Static Huffman Encoding

Dynamic Huffman Encoding

Decoding Implementation

Main Decoding Function

Reading Dynamic Trees

Handling Code Length Symbols

Helper Functions

Tree Building

Code Generation

Decoder Building

Symbol Decoding

Length/Distance Conversion

Memory Management

Error Handling

Testing

Test Static Huffman

Test Dynamic Huffman

Test Code Length Encoding

Test LZ77 Integration

Memory Management

Summary

Next Steps