Converting Models

llama.cpp requires models to be in GGUF format. If you have a model in PyTorch, SafeTensors, or another format, you’ll need to convert it first.

Overview

The conversion process transforms model weights and metadata from Hugging Face format (or other formats) into the GGUF format used by llama.cpp.

When to convert:

You have a model in PyTorch (.bin, .pt) or SafeTensors (.safetensors) format
You want to use a model from Hugging Face that isn’t available in GGUF
You’ve fine-tuned a model and need to convert it for inference

When to skip:

The model is already available in GGUF format on Hugging Face
You can use a pre-converted version

Quick Start

The main conversion script is convert_hf_to_gguf.py:

# Install Python dependencies
python3 -m pip install -r requirements.txt

# Convert a Hugging Face model
python3 convert_hf_to_gguf.py /path/to/model/

# Output will be: /path/to/model/ggml-model-f16.gguf

Step-by-Step Conversion Process

Obtain the Model

First, download the model in its original format from Hugging Face or another source.

# Using git LFS
git lfs install
git clone https://huggingface.co/meta-llama/Llama-3.1-8B

# Or use huggingface-cli
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./models/llama-3.1-8b

You should see files like:

config.json
tokenizer.json / tokenizer.model
model-*.safetensors or pytorch_model-*.bin

Install Dependencies

Install the required Python packages:

cd llama.cpp
python3 -m pip install -r requirements.txt

Key dependencies:

torch - PyTorch for loading model weights
transformers - Hugging Face transformers library
numpy - Numerical operations
gguf - GGUF format library

Run Conversion

Convert the model to GGUF format:

python3 convert_hf_to_gguf.py ./models/llama-3.1-8b/

The script will:

Load the model configuration
Read model weights
Convert tensors to GGUF format
Save the output file

This may take several minutes depending on model size.

Verify Conversion

Test the converted model:

./llama-cli -m ./models/llama-3.1-8b/ggml-model-f16.gguf -p "Hello" -n 20

If the model generates coherent text, the conversion was successful.

Conversion Script Reference

convert_hf_to_gguf.py

The primary conversion script for Hugging Face models.

Common Options

python3 convert_hf_to_gguf.py [options] <model_directory>

Positional arguments:
  model_directory       Path to the model directory (contains config.json)

Options:
  --vocab-only          Extract only the vocabulary/tokenizer
  --outfile FILE        Output file path (default: ggml-model-f16.gguf)
  --outtype TYPE        Output data type: f32, f16, bf16 (default: f16)
  --bigendian          Use big-endian format (default: little-endian)
  --model-name NAME    Model name to embed in metadata
  --verbose            Increase verbosity
  --help               Show help message

Output Types

f16 (default): 16-bit floating point - good balance of size and quality
f32: 32-bit floating point - full precision, largest file
bf16: BFloat16 - alternative 16-bit format, same size as f16

For most users, f16 is the best choice as it maintains quality while reducing file size by ~50% compared to f32.

Other Conversion Scripts

convert_lora_to_gguf.py

Convert LoRA (Low-Rank Adaptation) adapters to GGUF format:

python3 convert_lora_to_gguf.py ./path/to/lora/

Useful for fine-tuned models using the LoRA technique. See the GGUF-my-LoRA space for online conversion.

convert_llama_ggml_to_gguf.py

Convert old GGML format to current GGUF format:

python3 convert_llama_ggml_to_gguf.py ./old-model.ggml

Only needed for very old llama.cpp models from before the GGUF format was introduced.

Supported Model Architectures

The conversion script automatically detects the model architecture from config.json. Supported architectures include:

- LLaMA (meta-llama/Llama-*)
- LLaMA 2 (meta-llama/Llama-2-*)
- LLaMA 3 (meta-llama/Llama-3-*)
- Code Llama variants

For a complete list, see Supported Models.

Advanced Conversion

Converting from ModelScope

Models from ModelScope can be converted the same way:

# Download from ModelScope
modelscope download --model <model_id> --local_dir ./models/model-name

# Convert as normal
python3 convert_hf_to_gguf.py ./models/model-name/

Vocabulary-Only Conversion

For testing tokenizers or when you only need vocabulary:

python3 convert_hf_to_gguf.py ./model/ --vocab-only --outfile vocab.gguf

This creates a much smaller file containing only the tokenizer information.

Custom Metadata

Embed custom metadata during conversion:

python3 convert_hf_to_gguf.py ./model/ --model-name "My Custom Model v1.2"

The metadata can be viewed with llama-cli --model-info.

Online Conversion Tools

If you prefer not to set up a local environment, use these Hugging Face spaces:

GGUF-my-repo

GGUF-my-repo - Official converter and quantizerFeatures:

Convert any Hugging Face model to GGUF
Automatically quantize to multiple formats
No local setup required
Results published to your Hugging Face account

How to use:

Visit the space
Enter the model repository name
Select quantization options
Click “Submit”
Download the resulting GGUF files

The space is synced from llama.cpp main branch every 6 hours, so it uses recent conversion code.

GGUF-my-LoRA

GGUF-my-LoRA - Convert LoRA adaptersSpecialized tool for converting LoRA fine-tuned models. See discussion for details.

Troubleshooting

ModuleNotFoundError: No module named 'torch'

Solution: Install requirements:

python3 -m pip install -r requirements.txt

Model architecture not recognized

Symptoms:

Error: Unknown model architecture

Solutions:

Check if your model architecture is supported in Supported Models
Update llama.cpp to the latest version
If it’s a new architecture, it may not be supported yet

For adding new model support, see HOWTO-add-model.md.

Out of memory during conversion

Solution: The conversion process loads the entire model into memory. For large models (70B+):

Use a machine with sufficient RAM (at least 2x the model size)
Close other applications
Consider using the GGUF-my-repo online tool instead

Conversion is very slow

This is normal for large models. Expected times:

7B model: 2-5 minutes
13B model: 5-10 minutes
70B model: 30-60 minutes

The script shows progress as it processes tensors.

TypeError or tensor shape errors

Solution: Ensure you have the latest version of llama.cpp:

git pull origin master

Model formats change, and older conversion scripts may not work with newer models.

After Conversion

Once you have a GGUF file, you can:

Use it directly if the F16 size is acceptable:
```
./llama-cli -m model.gguf
```
Quantize it to reduce size (recommended):
```
./llama-quantize model.gguf model-q4.gguf Q4_K_M
```
See Quantizing Models for details.
Share it on Hugging Face for others to use

Example: Complete Workflow

Here’s a complete example converting and using a model:

# 1. Clone the repository
huggingface-cli download meta-llama/Llama-3.1-8B \
  --local-dir ./models/llama-3.1-8b

# 2. Install dependencies
cd llama.cpp
python3 -m pip install -r requirements.txt

# 3. Convert to GGUF
python3 convert_hf_to_gguf.py ../models/llama-3.1-8b/

# 4. Test the model
./llama-cli -m ../models/llama-3.1-8b/ggml-model-f16.gguf \
  -p "Explain quantum computing in simple terms" \
  -n 100

Next Steps

Learn about Quantizing Models to reduce model size
See Supported Models for architecture compatibility
Read about Obtaining Models to find pre-converted GGUF files

Get Started

Core Concepts

Inference

Models

Advanced

Overview

Quick Start

Step-by-Step Conversion Process

Conversion Script Reference

convert_hf_to_gguf.py

Other Conversion Scripts

Supported Model Architectures

Advanced Conversion

Converting from ModelScope

Vocabulary-Only Conversion

Custom Metadata

Online Conversion Tools

Troubleshooting

After Conversion

Example: Complete Workflow

Next Steps

Get Started

Core Concepts

Inference

Models

Advanced

​Overview

​Quick Start

​Step-by-Step Conversion Process

​Conversion Script Reference

​convert_hf_to_gguf.py

​Other Conversion Scripts

​Supported Model Architectures

​Advanced Conversion

​Converting from ModelScope

​Vocabulary-Only Conversion

​Custom Metadata

​Online Conversion Tools

​Troubleshooting

​After Conversion

​Example: Complete Workflow

​Next Steps

Overview

Quick Start

Step-by-Step Conversion Process

Conversion Script Reference

convert_hf_to_gguf.py

Other Conversion Scripts

Supported Model Architectures

Advanced Conversion

Converting from ModelScope

Vocabulary-Only Conversion

Custom Metadata

Online Conversion Tools

Troubleshooting

After Conversion

Example: Complete Workflow

Next Steps