Skip to main content
There are several ways to obtain models for use with llama.cpp. All models must be in GGUF format to work with llama.cpp.

Quick Start: Direct Download with -hf Flag

The easiest way to use models is with the -hf flag, which automatically downloads models from Hugging Face:
# Download and run a model directly
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Specify a particular quantization
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF:Q4_K_M
The -hf flag downloads models to your local cache and reuses them on subsequent runs. You don’t need to download the model manually.

Finding Models on Hugging Face

Hugging Face hosts thousands of GGUF-format models compatible with llama.cpp:
1

Browse GGUF Models

Visit the GGUF models page to see trending models.Popular collections:
2

Choose a Model

Select a model based on your needs:
  • Size: Smaller models (1B-7B) run on consumer hardware, larger models (70B+) need more resources
  • Quantization: Lower quantization (Q4, Q5) = smaller file, higher (Q8, F16) = better quality
  • Task: Some models are optimized for chat, coding, or specific domains
3

Use with llama.cpp

Once you’ve found a model, use it with the -hf flag:
llama-cli -hf <username>/<repository>

Example Model Repositories

Manual Download from Hugging Face

If you prefer to download models manually:
1

Navigate to Model Repository

Go to the model’s Hugging Face page (e.g., https://huggingface.co/ggml-org/gemma-3-1b-it-GGUF)
2

Browse Files

Click on the “Files and versions” tab to see all available files.Look for .gguf files - these are ready to use with llama.cpp.
3

Download GGUF File

Download the specific quantization you want:
  • *-Q4_K_M.gguf - Good balance (recommended for most users)
  • *-Q5_K_M.gguf - Higher quality
  • *-Q8_0.gguf - Near-original quality
  • *-f16.gguf - Full precision
4

Run Locally

Use the downloaded file with llama.cpp:
llama-cli -m /path/to/model.gguf

Alternative Model Sources

ModelScope (China)

For users in China or preferring ModelScope, you can switch the model endpoint:
# Set environment variable to use ModelScope
export MODEL_ENDPOINT=https://www.modelscope.cn/

# Then use -hf flag as normal
llama-cli -hf username/model-name
The MODEL_ENDPOINT environment variable tells llama.cpp where to download models from. By default, it uses Hugging Face.

Other Sources

You can also obtain GGUF models from:
  • Ollama Library: Use ollama-dl to download Ollama models for use with llama.cpp
  • Direct conversions: Convert your own models using the conversion tools
  • Research institutions: Some organizations host their own model repositories

Pre-quantized vs Original Models

Pros:
  • You control the quantization process
  • Can use importance matrix for better quality
  • Can experiment with different quantization levels
Cons:
  • Requires conversion step
  • Larger initial download
  • Takes time to quantize
Use when: You need specific quantization settings or want maximum control over quality.

Verifying Model Downloads

1

Check File Size

Ensure the downloaded file size matches the expected size on the model page.
2

Test the Model

Run a simple test to verify the model works:
llama-cli -m model.gguf -p "Hello, how are you?" -n 50
3

Check Output

Verify the model generates coherent text appropriate to its training.

Model Storage Locations

When using the -hf flag, models are cached locally:
# Default Hugging Face cache location
~/.cache/huggingface/hub/

# Models are stored with their repository structure
~/.cache/huggingface/hub/models--<username>--<repo>/
You can also manually place models anywhere and reference them with the -m flag:
llama-cli -m /custom/path/to/model.gguf

Model Selection Guidelines

By Hardware

HardwareRecommended Model SizeQuantization
Mobile/Low-end1B-3BQ4_K_M, Q4_0
Laptop/Desktop (8-16GB RAM)7B-13BQ4_K_M, Q5_K_M
High-end Desktop (32GB+ RAM)13B-34BQ5_K_M, Q6_K
Workstation/Server (64GB+ RAM)70B+Q4_K_M, Q5_K_M

By Use Case

  • LLaMA 3 Instruct variants
  • Mistral Instruct
  • Gemma IT models
  • Qwen Chat models
Look for models with -Instruct, -IT, or -Chat in the name.
  • StarCoder models
  • Granite Code models
  • CodeLlama variants
  • Qwen-Coder
These are specifically trained on code and perform better for programming tasks.
  • Bloom
  • Qwen (strong Chinese support)
  • mGPT
  • Aya models
Choose based on your target languages.

Next Steps

Once you have obtained a model:
  1. If it’s already in GGUF format, you can use it directly
  2. If it’s in another format, see Converting Models
  3. To reduce size further, see Quantizing Models
  4. Check Supported Models for compatibility information

Troubleshooting

Possible causes:
  • Network connectivity issues
  • Invalid repository name
  • Repository is private
Solution: Try manual download from Hugging Face website, or check the repository exists and is public.
Symptoms:
  • Load errors
  • Unexpected behavior
  • Crashes
Solution: Re-download the model. Check file size matches expected size.
Solution: Use a smaller model or lower quantization (e.g., Q4_K_M instead of Q8_0).