Quick Start: Direct Download with -hf Flag
The easiest way to use models is with the-hf flag, which automatically downloads models from Hugging Face:
The
-hf flag downloads models to your local cache and reuses them on subsequent runs. You don’t need to download the model manually.Finding Models on Hugging Face
Hugging Face hosts thousands of GGUF-format models compatible with llama.cpp:Browse GGUF Models
Visit the GGUF models page to see trending models.Popular collections:
- Trending GGUF models
- LLaMA models
- Official llama.cpp models: ggml-org
Choose a Model
Select a model based on your needs:
- Size: Smaller models (1B-7B) run on consumer hardware, larger models (70B+) need more resources
- Quantization: Lower quantization (Q4, Q5) = smaller file, higher (Q8, F16) = better quality
- Task: Some models are optimized for chat, coding, or specific domains
Example Model Repositories
Recommended GGUF Repositories
Recommended GGUF Repositories
- ggml-org/gemma-3-1b-it-GGUF - Small, efficient instruction model
- TheBloke repositories - Large collection of quantized models (community contributor)
- Original model authors - Many official models now include GGUF files
- bartowski - Another popular GGUF converter with many models
Manual Download from Hugging Face
If you prefer to download models manually:Navigate to Model Repository
Go to the model’s Hugging Face page (e.g.,
https://huggingface.co/ggml-org/gemma-3-1b-it-GGUF)Browse Files
Click on the “Files and versions” tab to see all available files.Look for
.gguf files - these are ready to use with llama.cpp.Download GGUF File
Download the specific quantization you want:
*-Q4_K_M.gguf- Good balance (recommended for most users)*-Q5_K_M.gguf- Higher quality*-Q8_0.gguf- Near-original quality*-f16.gguf- Full precision
Alternative Model Sources
ModelScope (China)
For users in China or preferring ModelScope, you can switch the model endpoint:The
MODEL_ENDPOINT environment variable tells llama.cpp where to download models from. By default, it uses Hugging Face.Other Sources
You can also obtain GGUF models from:- Ollama Library: Use ollama-dl to download Ollama models for use with llama.cpp
- Direct conversions: Convert your own models using the conversion tools
- Research institutions: Some organizations host their own model repositories
Pre-quantized vs Original Models
Pre-quantized GGUF Models (Recommended)
Pre-quantized GGUF Models (Recommended)
Pros:
- Ready to use immediately
- No conversion needed
- Multiple quantization levels available
- Smaller download size
- May not have the exact quantization you want
- Depends on someone else doing the quantization
Original Format Models
Original Format Models
Pros:
- You control the quantization process
- Can use importance matrix for better quality
- Can experiment with different quantization levels
- Requires conversion step
- Larger initial download
- Takes time to quantize
Verifying Model Downloads
Model Storage Locations
When using the-hf flag, models are cached locally:
-m flag:
Model Selection Guidelines
By Hardware
| Hardware | Recommended Model Size | Quantization |
|---|---|---|
| Mobile/Low-end | 1B-3B | Q4_K_M, Q4_0 |
| Laptop/Desktop (8-16GB RAM) | 7B-13B | Q4_K_M, Q5_K_M |
| High-end Desktop (32GB+ RAM) | 13B-34B | Q5_K_M, Q6_K |
| Workstation/Server (64GB+ RAM) | 70B+ | Q4_K_M, Q5_K_M |
By Use Case
Chat/Assistant
Chat/Assistant
- LLaMA 3 Instruct variants
- Mistral Instruct
- Gemma IT models
- Qwen Chat models
-Instruct, -IT, or -Chat in the name.Code Generation
Code Generation
- StarCoder models
- Granite Code models
- CodeLlama variants
- Qwen-Coder
Multilingual
Multilingual
- Bloom
- Qwen (strong Chinese support)
- mGPT
- Aya models
Next Steps
Once you have obtained a model:- If it’s already in GGUF format, you can use it directly
- If it’s in another format, see Converting Models
- To reduce size further, see Quantizing Models
- Check Supported Models for compatibility information
Troubleshooting
Download fails with -hf flag
Download fails with -hf flag
Possible causes:
- Network connectivity issues
- Invalid repository name
- Repository is private
Model file is corrupted
Model file is corrupted
Symptoms:
- Load errors
- Unexpected behavior
- Crashes
Out of memory when loading
Out of memory when loading
Solution:
Use a smaller model or lower quantization (e.g., Q4_K_M instead of Q8_0).

