GGUF File Format
GGUF (GPT-Generated Unified Format) is the binary file format used by llama.cpp to store and distribute quantized language models. It’s designed specifically for efficient loading and inference of large language models.Overview
GGUF files are self-contained binary files that include:- Model weights (tensors) in various quantized formats
- Metadata as key-value pairs
- Tensor descriptors with shape and type information
- Optional alignment for efficient memory access
GGUF is version 3 of the format, succeeding earlier GGML formats. It provides better extensibility and metadata support.
File Structure
A GGUF file follows this precise structure:1. Header Section
2. Key-Value Metadata
For each KV pair:- Key (string): Metadata identifier
- Value type (gguf_type): Data type enum
- Value data: Binary representation
- Array element type
- Number of elements (uint64_t)
- Binary data for each element
Common metadata keys include
general.architecture, general.name, general.alignment, and model-specific hyperparameters like layer counts and dimensions.3. Tensor Descriptors
For each tensor:- Tensor name (string): e.g., “token_embd.weight”
- Number of dimensions (uint32_t)
- Dimension sizes (int64_t array): Shape of the tensor
- Data type (ggml_type): Quantization format
- Data offset (uint64_t): Position in the data blob
4. Tensor Data Blob
The actual tensor data, stored contiguously with optional alignment padding.The default alignment is 32 bytes (
GGUF_DEFAULT_ALIGNMENT), but can be customized via the general.alignment metadata key.Data Types
GGUF supports multiple data types for metadata:Tensor Quantization Types
GGUF files can store tensors in various quantization formats:- Floating Point: F32, F16, BF16
- K-Quants: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_K
- I-Quants: IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, IQ3_XXS, IQ3_XS, IQ3_S, IQ3_M, IQ4_XS, IQ4_NL
- Legacy: Q4_0, Q4_1, Q5_0, Q5_1, Q8_0
- Experimental: TQ1_0, TQ2_0, MXFP4
Working with GGUF Files
Loading a GGUF File
Reading Metadata
Accessing Tensors
Writing GGUF Files
There are three ways to write GGUF files:Method 1: Single Pass Write
Method 1: Single Pass Write
Method 2: Metadata First, Then Data
Method 2: Metadata First, Then Data
Method 3: Reserve Space, Write Data, Write Metadata
Method 3: Reserve Space, Write Data, Write Metadata
Tools for Working with GGUF
gguf-parser
Review and inspect GGUF files, estimate memory usage
GGUF-my-repo
Convert and quantize models to GGUF format on Hugging Face
Visit Space
GGUF Editor
Edit GGUF metadata in your browser
Visit Space
llama-quantize
Convert and quantize GGUF files locally
Common Metadata Keys
Key metadata found in GGUF files:general.architecture: Model architecture (llama, falcon, gpt2, etc.)general.name: Model namegeneral.file_type: Quantization typegeneral.alignment: Data alignment in bytes{arch}.context_length: Maximum context length{arch}.embedding_length: Embedding dimension{arch}.block_count: Number of transformer layers{arch}.attention.head_count: Number of attention headstokenizer.ggml.model: Tokenizer type (llama, gpt2, etc.)
Reference
For the complete GGUF specification and API reference, see:Module Maintainer: Johannes Gäßler (@JohannesGaessler, [email protected])

