Syntax
Description
Download a GGUF model from HuggingFace for use with llama.cpp. The command accepts multiple input formats:- HuggingFace repo (e.g.,
bartowski/Llama-3.1-8B-Instruct-GGUF) - Search query (e.g.,
llama 8b) - Known model name (e.g.,
llama-3.1-8b-instruct)
Arguments
Model to download. Can be a HuggingFace repo, search query, or known model name.
Options
Specific GGUF quantization to download (e.g.,
Q4_K_M, Q8_0). If omitted, selects the best quantization that fits your hardware based on available VRAM/RAM.Maximum memory budget in GB for quantization selection. Useful for constraining downloads to models that fit within a specific memory limit.
List available GGUF files in the repository without downloading. Useful for exploring quantization options before committing to a download.
Override GPU VRAM size (e.g.,
32G, 32000M, 1.5T). Global flag for hardware detection override.Cap context length used for memory estimation (tokens). Global flag.
Usage Examples
Download with automatic quantization selection
Download specific quantization
List available quantizations
Download with memory budget
Download with VRAM override
Example Output
Notes
- Downloads are cached in
~/.cache/llama.cpp/to avoid re-downloading - If
llama-cliorllama-serveris not installed, llmfit will still download the model but won’t be able to run it via theruncommand - The
--listflag is useful for exploring available quantizations before committing to a large download - Quantization selection considers both VRAM (for GPU inference) and RAM (for CPU fallback)
Related Commands
- llmfit run - Run a downloaded GGUF model
- llmfit hf-search - Search HuggingFace for GGUF models
- llmfit fit - Find models that fit your hardware
