Overview
The conversion process transforms model weights and metadata from Hugging Face format (or other formats) into the GGUF format used by llama.cpp.When to convert:
- You have a model in PyTorch (
.bin,.pt) or SafeTensors (.safetensors) format - You want to use a model from Hugging Face that isn’t available in GGUF
- You’ve fine-tuned a model and need to convert it for inference
- The model is already available in GGUF format on Hugging Face
- You can use a pre-converted version
Quick Start
The main conversion script isconvert_hf_to_gguf.py:
Step-by-Step Conversion Process
Obtain the Model
First, download the model in its original format from Hugging Face or another source.You should see files like:
config.jsontokenizer.json/tokenizer.modelmodel-*.safetensorsorpytorch_model-*.bin
Install Dependencies
Install the required Python packages:Key dependencies:
torch- PyTorch for loading model weightstransformers- Hugging Face transformers librarynumpy- Numerical operationsgguf- GGUF format library
Run Conversion
Convert the model to GGUF format:The script will:
- Load the model configuration
- Read model weights
- Convert tensors to GGUF format
- Save the output file
Conversion Script Reference
convert_hf_to_gguf.py
The primary conversion script for Hugging Face models.Common Options
Common Options
Output Types
Output Types
- f16 (default): 16-bit floating point - good balance of size and quality
- f32: 32-bit floating point - full precision, largest file
- bf16: BFloat16 - alternative 16-bit format, same size as f16
Other Conversion Scripts
convert_lora_to_gguf.py
convert_lora_to_gguf.py
Convert LoRA (Low-Rank Adaptation) adapters to GGUF format:Useful for fine-tuned models using the LoRA technique. See the GGUF-my-LoRA space for online conversion.
convert_llama_ggml_to_gguf.py
convert_llama_ggml_to_gguf.py
Convert old GGML format to current GGUF format:Only needed for very old llama.cpp models from before the GGUF format was introduced.
Supported Model Architectures
The conversion script automatically detects the model architecture fromconfig.json. Supported architectures include:
Advanced Conversion
Converting from ModelScope
Models from ModelScope can be converted the same way:Vocabulary-Only Conversion
For testing tokenizers or when you only need vocabulary:Custom Metadata
Embed custom metadata during conversion:llama-cli --model-info.
Online Conversion Tools
If you prefer not to set up a local environment, use these Hugging Face spaces:GGUF-my-repo
GGUF-my-repo
GGUF-my-repo - Official converter and quantizerFeatures:
- Convert any Hugging Face model to GGUF
- Automatically quantize to multiple formats
- No local setup required
- Results published to your Hugging Face account
- Visit the space
- Enter the model repository name
- Select quantization options
- Click “Submit”
- Download the resulting GGUF files
The space is synced from llama.cpp main branch every 6 hours, so it uses recent conversion code.
GGUF-my-LoRA
GGUF-my-LoRA
GGUF-my-LoRA - Convert LoRA adaptersSpecialized tool for converting LoRA fine-tuned models. See discussion for details.
Troubleshooting
ModuleNotFoundError: No module named 'torch'
ModuleNotFoundError: No module named 'torch'
Solution:
Install requirements:
Model architecture not recognized
Model architecture not recognized
Symptoms:Solutions:
- Check if your model architecture is supported in Supported Models
- Update llama.cpp to the latest version
- If it’s a new architecture, it may not be supported yet
Out of memory during conversion
Out of memory during conversion
Solution:
The conversion process loads the entire model into memory. For large models (70B+):
- Use a machine with sufficient RAM (at least 2x the model size)
- Close other applications
- Consider using the GGUF-my-repo online tool instead
Conversion is very slow
Conversion is very slow
This is normal for large models. Expected times:
- 7B model: 2-5 minutes
- 13B model: 5-10 minutes
- 70B model: 30-60 minutes
TypeError or tensor shape errors
TypeError or tensor shape errors
Solution:
Ensure you have the latest version of llama.cpp:Model formats change, and older conversion scripts may not work with newer models.
After Conversion
Once you have a GGUF file, you can:-
Use it directly if the F16 size is acceptable:
-
Quantize it to reduce size (recommended):
See Quantizing Models for details.
- Share it on Hugging Face for others to use
Example: Complete Workflow
Here’s a complete example converting and using a model:Next Steps
- Learn about Quantizing Models to reduce model size
- See Supported Models for architecture compatibility
- Read about Obtaining Models to find pre-converted GGUF files

