Overview
GNU ZIP (GZIP) files are a file format used for lossless data compression. A GZIP file consists of one or more members, each containing compressed data along with metadata about that data.File Structure
A GZIP file has the following high-level structure:GZIP Magic Number
All GZIP members start with the same 2-byte identifier:our_zlib.h:65:
The first 2 bytes of all members are the same. They are the following values: 0x1f 0x8b.
Member Header Structure
The GZIP member header contains metadata about the compressed data.Required Fields
These fields are present in every member:| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 1 | ID1 | Identification byte 1 (0x1f) |
| 1 | 1 | ID2 | Identification byte 2 (0x8b) |
| 2 | 1 | CM | Compression Method |
| 3 | 1 | FLG | Flags |
| 4 | 4 | MTIME | Modification time (Unix timestamp) |
| 8 | 1 | XFL | Extra flags |
| 9 | 1 | OS | Operating system |
Compression Method (CM)
From README line 330:Compression Method (1 byte): Either 0 (no compression) or 8 (standard deflate compression algorithm)
- 0: No compression (raw data)
- 8: DEFLATE compression (standard)
Flags (FLG)
Fromour_zlib.h:31-35:
- FTEXT (bit 0): File is probably ASCII text (hint only)
- FHCRC (bit 1): 16-bit CRC of header is present
- FEXTRA (bit 2): Extra field is present
- FNAME (bit 3): Original file name is present
- FCOMMENT (bit 4): Comment is present
Modification Time (MTIME)
A 32-bit Unix timestamp (seconds since January 1, 1970 00:00:00 GMT), stored in little-endian format. From README line 332:All multi-byte integers are stored in little-endian format.
Extra Flags (XFL)
Flags for the compression algorithm:- 2: Maximum compression (slowest algorithm)
- 4: Fastest compression
- 0: Normal compression
Operating System (OS)
Indicates the filesystem where compression took place:- 0: FAT filesystem (MS-DOS, OS/2, NT/Win32)
- 1: Amiga
- 2: VMS
- 3: Unix
- 4: VM/CMS
- 5: Atari TOS
- 6: HPFS filesystem (OS/2, NT)
- 7: Macintosh
- 8: Z-System
- 9: CP/M
- 10: TOPS-20
- 11: NTFS filesystem (NT)
- 12: QDOS
- 13: Acorn RISCOS
- 255: Unknown
Optional Header Fields
These fields appear after the 10-byte required header, in the following order (if present):Extra Field (FEXTRA)
IfFLG & F_EXTRA is set:
| Size | Field | Description |
|---|---|---|
| 2 | XLEN | Length of extra field (little-endian) |
| XLEN | Extra data | Extra field data |
Original File Name (FNAME)
IfFLG & F_NAME is set:
A zero-terminated string containing the original file name.
Comment (FCOMMENT)
IfFLG & F_COMMENT is set:
A zero-terminated string containing a comment.
Header CRC (FHCRC)
IfFLG & F_HCRC is set:
A 16-bit CRC (little-endian) computed over all header bytes from ID1 through the end of the optional fields (but not including the HCRC itself).
Compressed Data
After the header comes the compressed data, consisting of one or more blocks. See DEFLATE and INFLATE for details on the compressed data format.Member Footer
After the compressed data, each member has an 8-byte footer:| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 | CRC32 | CRC-32 of uncompressed data (little-endian) |
| 4 | 4 | ISIZE | Size of uncompressed data modulo 2^32 (little-endian) |
CRC32
From README line 332:CRC (4 bytes): A cyclic redundancy check computed over the uncompressed data, represented as a little-endian unsigned 32-bit integer.Use the
get_crc() function from crc.h:8 to compute:
ISIZE
From README line 333:ISIZE (4 bytes): size of the data before it was compressedThis is the size modulo 2^32, so for files larger than 4 GB, this value wraps around.
Data Structure
Fromour_zlib.h:50-63:
The
extra, name, and comment fields are pointers that should be set to NULL if the corresponding flag is not set. Remember to free these when done!Parsing a Member
Fromour_zlib.h:66-67:
parse_member()
From README lines 513-525:file- Pointer to a file with file pointer at the start of member dataheader- Pointer to output structure where parsed data will be written
0on success-1on error (NULL pointers, failure to read from file)
Pseudocode:
skip_gz_header_to_compressed_data()
parse_member(), but stops after reading the header, leaving the file pointer positioned at the start of compressed data.
This is used in decompression mode (see main.c:152).
Multi-Member Files
A GZIP file can contain multiple members concatenated together. Each member is independent and can be decompressed separately. Frommain.c:86-99, the member summary loop shows how to handle multiple members:
Member Summary Output
From README lines 197-203 andglobal.h:26-32:
If the member has a name field, use that as the label. Otherwise, use the member index (0, 1, 2, …).The comment is only printed if present.
CRC Validation
To validate the CRC:If a member has an invalid CRC or Header CRC (HCRC) or parsing fails, an error message is printed to standard error.
Endianness
Helper functions for reading little-endian values:Memory Management
Implementation Files
GZIP member parsing is implemented inzlib.c.
See also:
main.c:82-100- Member summary mode implementationmain.c:141-187- Decompression mode implementationmain.c:102-139- Compression mode implementation
Summary
Key points about GZIP format:- Magic number: 0x1f 0x8b (always first 2 bytes)
- Required header: 10 bytes (ID, CM, FLG, MTIME, XFL, OS)
- Optional fields: EXTRA, NAME, COMMENT, HCRC (based on FLG bits)
- Compressed data follows header
- Footer: 8 bytes (CRC32, ISIZE)
- All multi-byte integers are little-endian
- Files can contain multiple members
- Use provided macros for output formatting
- Free allocated strings in header
Next Steps
- DEFLATE - Compressing data into GZIP format
- INFLATE - Decompressing GZIP data
- CLI Arguments - Member summary mode usage