Quick Start

Get up and running with llama.cpp quickly. This guide walks you through installation, downloading a model, and running your first inference.

Installation

Install llama.cpp

Choose your preferred installation method:

brew install llama.cpp

For GPU acceleration, custom builds, or other installation options, see the Installation Guide.

Verify installation

Check that llama.cpp is installed correctly:

llama-cli --version

You should see version information displayed in the output.

Download a model

llama.cpp works with models in GGUF format. You can download pre-quantized models directly from Hugging Face.

GGUF is the native format for llama.cpp. Many popular models are available pre-converted on Hugging Face.

Option 1: Download automatically during inferencellama.cpp can download models directly from Hugging Face when you run inference:

# llama.cpp will download the model automatically
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

Option 2: Download manuallyVisit Hugging Face’s GGUF models and download your preferred model. Popular options include:

Look for files with the .gguf extension, typically with quantization levels like Q4_0 or Q8_0.

Run your first inference

Now you’re ready to run inference! Here are the two main ways to use llama.cpp:

Interactive CLI
OpenAI-Compatible Server

Start an interactive conversation with the model:

# Using a downloaded model
llama-cli -m path/to/model.gguf

# Or download from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

The CLI will enter conversation mode automatically for chat-tuned models. Type your messages and press Enter to interact.Example conversation:

> hi, who are you?
Hi there! I'm your helpful assistant! I'm an AI-powered chatbot 
designed to assist and provide information to users like you.

> what is 1+1?
Easy peasy! The answer to 1+1 is... 2!

Launch an HTTP server with an OpenAI-compatible API:

# Start the server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF --port 8080

# Or with a local model
llama-server -m path/to/model.gguf --port 8080

Once running, you can:

Access the web UI at http://localhost:8080
Use the chat completions endpoint at http://localhost:8080/v1/chat/completions
Send requests using the OpenAI Python SDK or any HTTP client

Example with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Common Use Cases

Text Generation

Generate creative content, complete prompts, or continue text:

llama-cli -m model.gguf -p "Once upon a time" -n 256

Conversation Mode

Chat interactively with AI models:

llama-cli -m model.gguf -cnv

API Server

Host models as an OpenAI-compatible API:

llama-server -m model.gguf --port 8080

JSON Output

Constrain output to valid JSON:

llama-cli -m model.gguf \
  --grammar-file grammars/json.gbnf \
  -p "Request: schedule a call at 8pm; Command:"

GPU Acceleration

For significantly faster inference, enable GPU acceleration:

# Use the --n-gpu-layers flag to offload layers to GPU
llama-cli -m model.gguf -ngl 99

The -ngl (or --n-gpu-layers) flag specifies how many model layers to offload to the GPU. Using -ngl 99 typically offloads all layers for most models.

GPU acceleration requires building llama.cpp from source with the appropriate backend enabled. Pre-built binaries from package managers typically only include CPU support. See the Installation Guide for details.

Performance Tips

Choose the right quantization

Model quantization affects both speed and quality:

Q4_0: Fast, smallest size, lower quality
Q5_1: Balanced speed and quality
Q8_0: Slower, higher quality, larger size

Start with Q4_0 for testing, then try higher quantizations if quality isn’t sufficient.

Adjust context size

The context window affects memory usage:

# Default context is usually 512-2048 tokens
# Increase for longer conversations (uses more memory)
llama-cli -m model.gguf -c 4096

Use multiple threads

Specify the number of CPU threads to use:

# Use 8 threads for faster CPU inference
llama-cli -m model.gguf -t 8

Enable parallel requests (server)

Handle multiple users simultaneously:

# Support 4 concurrent requests, each with 4096 max context
llama-server -m model.gguf -c 16384 -np 4

Next Steps

Installation

Learn about all installation methods including building from source with GPU support

CLI Reference

Explore all available command-line options and flags

Server API

Set up and use the OpenAI-compatible HTTP server

Model Conversion

Convert and quantize your own models to GGUF format

Troubleshooting

Model fails to load

Issue: Error loading model fileSolutions:

Verify the model file exists and path is correct
Check file permissions
Ensure the model is in GGUF format (not PyTorch or safetensors)
Try downloading the model again if it may be corrupted

Out of memory errors

Issue: Not enough RAM/VRAM to load the modelSolutions:

Use a smaller model or lower quantization (e.g., Q4_0 instead of Q8_0)
Reduce context size with -c 512
For GPU: Reduce layers offloaded with a lower -ngl value
Close other applications to free up memory

Slow inference speed

Issue: Generation is too slowSolutions:

Build with GPU support and use -ngl 99
Use a smaller model
Use a lower quantization (Q4_0)
Increase CPU threads with -t
Check that no other heavy processes are running

GPU not being used

Issue: GPU acceleration not working despite using -nglSolutions:

Verify llama.cpp was built with GPU support (check build output)
Check that GPU drivers are installed and up to date
Try --device flag to explicitly select GPU
Run llama-cli --list-devices to see available devices

Get Started

Core Concepts

Inference

Models

Advanced

Installation

Common Use Cases

Text Generation

Conversation Mode

API Server

JSON Output

GPU Acceleration

Performance Tips

Next Steps

Installation

CLI Reference

Server API

Model Conversion

Troubleshooting

Get Started

Core Concepts

Inference

Models

Advanced

​Installation

​Common Use Cases

Text Generation

Conversation Mode

API Server

JSON Output

​GPU Acceleration

​Performance Tips

​Next Steps

Installation

CLI Reference

Server API

Model Conversion

​Troubleshooting

Installation

Common Use Cases

GPU Acceleration

Performance Tips

Next Steps

Troubleshooting