Skip to main content
Get up and running with llama.cpp quickly. This guide walks you through installation, downloading a model, and running your first inference.

Installation

1

Install llama.cpp

Choose your preferred installation method:
brew install llama.cpp
For GPU acceleration, custom builds, or other installation options, see the Installation Guide.
2

Verify installation

Check that llama.cpp is installed correctly:
llama-cli --version
You should see version information displayed in the output.
3

Download a model

llama.cpp works with models in GGUF format. You can download pre-quantized models directly from Hugging Face.
GGUF is the native format for llama.cpp. Many popular models are available pre-converted on Hugging Face.
Option 1: Download automatically during inferencellama.cpp can download models directly from Hugging Face when you run inference:
# llama.cpp will download the model automatically
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
Option 2: Download manuallyVisit Hugging Face’s GGUF models and download your preferred model. Popular options include:Look for files with the .gguf extension, typically with quantization levels like Q4_0 or Q8_0.
4

Run your first inference

Now you’re ready to run inference! Here are the two main ways to use llama.cpp:
Start an interactive conversation with the model:
# Using a downloaded model
llama-cli -m path/to/model.gguf

# Or download from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
The CLI will enter conversation mode automatically for chat-tuned models. Type your messages and press Enter to interact.Example conversation:
> hi, who are you?
Hi there! I'm your helpful assistant! I'm an AI-powered chatbot 
designed to assist and provide information to users like you.

> what is 1+1?
Easy peasy! The answer to 1+1 is... 2!

Common Use Cases

Text Generation

Generate creative content, complete prompts, or continue text:
llama-cli -m model.gguf -p "Once upon a time" -n 256

Conversation Mode

Chat interactively with AI models:
llama-cli -m model.gguf -cnv

API Server

Host models as an OpenAI-compatible API:
llama-server -m model.gguf --port 8080

JSON Output

Constrain output to valid JSON:
llama-cli -m model.gguf \
  --grammar-file grammars/json.gbnf \
  -p "Request: schedule a call at 8pm; Command:"

GPU Acceleration

For significantly faster inference, enable GPU acceleration:
# Use the --n-gpu-layers flag to offload layers to GPU
llama-cli -m model.gguf -ngl 99
The -ngl (or --n-gpu-layers) flag specifies how many model layers to offload to the GPU. Using -ngl 99 typically offloads all layers for most models.
GPU acceleration requires building llama.cpp from source with the appropriate backend enabled. Pre-built binaries from package managers typically only include CPU support. See the Installation Guide for details.

Performance Tips

Model quantization affects both speed and quality:
  • Q4_0: Fast, smallest size, lower quality
  • Q5_1: Balanced speed and quality
  • Q8_0: Slower, higher quality, larger size
Start with Q4_0 for testing, then try higher quantizations if quality isn’t sufficient.
The context window affects memory usage:
# Default context is usually 512-2048 tokens
# Increase for longer conversations (uses more memory)
llama-cli -m model.gguf -c 4096
Specify the number of CPU threads to use:
# Use 8 threads for faster CPU inference
llama-cli -m model.gguf -t 8
Handle multiple users simultaneously:
# Support 4 concurrent requests, each with 4096 max context
llama-server -m model.gguf -c 16384 -np 4

Next Steps

Installation

Learn about all installation methods including building from source with GPU support

CLI Reference

Explore all available command-line options and flags

Server API

Set up and use the OpenAI-compatible HTTP server

Model Conversion

Convert and quantize your own models to GGUF format

Troubleshooting

Issue: Error loading model fileSolutions:
  • Verify the model file exists and path is correct
  • Check file permissions
  • Ensure the model is in GGUF format (not PyTorch or safetensors)
  • Try downloading the model again if it may be corrupted
Issue: Not enough RAM/VRAM to load the modelSolutions:
  • Use a smaller model or lower quantization (e.g., Q4_0 instead of Q8_0)
  • Reduce context size with -c 512
  • For GPU: Reduce layers offloaded with a lower -ngl value
  • Close other applications to free up memory
Issue: Generation is too slowSolutions:
  • Build with GPU support and use -ngl 99
  • Use a smaller model
  • Use a lower quantization (Q4_0)
  • Increase CPU threads with -t
  • Check that no other heavy processes are running
Issue: GPU acceleration not working despite using -nglSolutions:
  • Verify llama.cpp was built with GPU support (check build output)
  • Check that GPU drivers are installed and up to date
  • Try --device flag to explicitly select GPU
  • Run llama-cli --list-devices to see available devices