Installation
Install llama.cpp
Choose your preferred installation method:
For GPU acceleration, custom builds, or other installation options, see the Installation Guide.
Verify installation
Check that llama.cpp is installed correctly:You should see version information displayed in the output.
Download a model
llama.cpp works with models in GGUF format. You can download pre-quantized models directly from Hugging Face.Option 1: Download automatically during inferencellama.cpp can download models directly from Hugging Face when you run inference:Option 2: Download manuallyVisit Hugging Face’s GGUF models and download your preferred model. Popular options include:Look for files with the
GGUF is the native format for llama.cpp. Many popular models are available pre-converted on Hugging Face.
.gguf extension, typically with quantization levels like Q4_0 or Q8_0.Run your first inference
Now you’re ready to run inference! Here are the two main ways to use llama.cpp:
- Interactive CLI
- OpenAI-Compatible Server
Start an interactive conversation with the model:The CLI will enter conversation mode automatically for chat-tuned models. Type your messages and press Enter to interact.Example conversation:
Common Use Cases
Text Generation
Generate creative content, complete prompts, or continue text:
Conversation Mode
Chat interactively with AI models:
API Server
Host models as an OpenAI-compatible API:
JSON Output
Constrain output to valid JSON:
GPU Acceleration
For significantly faster inference, enable GPU acceleration:The
-ngl (or --n-gpu-layers) flag specifies how many model layers to offload to the GPU. Using -ngl 99 typically offloads all layers for most models.Performance Tips
Choose the right quantization
Choose the right quantization
Model quantization affects both speed and quality:
- Q4_0: Fast, smallest size, lower quality
- Q5_1: Balanced speed and quality
- Q8_0: Slower, higher quality, larger size
Adjust context size
Adjust context size
The context window affects memory usage:
Use multiple threads
Use multiple threads
Specify the number of CPU threads to use:
Enable parallel requests (server)
Enable parallel requests (server)
Handle multiple users simultaneously:
Next Steps
Installation
Learn about all installation methods including building from source with GPU support
CLI Reference
Explore all available command-line options and flags
Server API
Set up and use the OpenAI-compatible HTTP server
Model Conversion
Convert and quantize your own models to GGUF format
Troubleshooting
Model fails to load
Model fails to load
Issue: Error loading model fileSolutions:
- Verify the model file exists and path is correct
- Check file permissions
- Ensure the model is in GGUF format (not PyTorch or safetensors)
- Try downloading the model again if it may be corrupted
Out of memory errors
Out of memory errors
Issue: Not enough RAM/VRAM to load the modelSolutions:
- Use a smaller model or lower quantization (e.g., Q4_0 instead of Q8_0)
- Reduce context size with
-c 512 - For GPU: Reduce layers offloaded with a lower
-nglvalue - Close other applications to free up memory
Slow inference speed
Slow inference speed
Issue: Generation is too slowSolutions:
- Build with GPU support and use
-ngl 99 - Use a smaller model
- Use a lower quantization (Q4_0)
- Increase CPU threads with
-t - Check that no other heavy processes are running
GPU not being used
GPU not being used
Issue: GPU acceleration not working despite using
-nglSolutions:- Verify llama.cpp was built with GPU support (check build output)
- Check that GPU drivers are installed and up to date
- Try
--deviceflag to explicitly select GPU - Run
llama-cli --list-devicesto see available devices

