Installation

llama.cpp can be installed via package managers for quick setup, or built from source for GPU acceleration and custom configurations.

Package Managers (Recommended for CPU)

The easiest way to install llama.cpp is through your system’s package manager. These pre-built binaries work out-of-the-box but typically only include CPU support.

Homebrew

macOS and Linux

brew install llama.cpp

Automatically updated with new releases.

Winget

Windows

winget install llama.cpp

Automatically updated with new releases.

MacPorts

macOS

sudo port install llama.cpp

More info →

Nix

macOS and Linux

# Flake-enabled
nix profile install nixpkgs#llama-cpp

# Traditional
nix-env --file '<nixpkgs>' --install --attr llama-cpp

Package manager installations are ideal for getting started quickly. For GPU acceleration, you’ll need to build from source.

Docker Images

Pre-built Docker images are available with and without GPU support:

CPU
CUDA
ROCm
Other GPUs

Three image variants are available:

# Full image: includes CLI, completion, and conversion tools
docker pull ghcr.io/ggml-org/llama.cpp:full

# Light image: CLI and completion only
docker pull ghcr.io/ggml-org/llama.cpp:light

# Server image: HTTP server only
docker pull ghcr.io/ggml-org/llama.cpp:server

Example usage:

# Run inference
docker run -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:light \
  -m /models/model.gguf -p "Hello world"

# Start server
docker run -v /path/to/models:/models -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/model.gguf --port 8080 --host 0.0.0.0

NVIDIA GPU support:

# Pull CUDA-enabled image
docker pull ghcr.io/ggml-org/llama.cpp:full-cuda
docker pull ghcr.io/ggml-org/llama.cpp:light-cuda
docker pull ghcr.io/ggml-org/llama.cpp:server-cuda

Example usage:

# Requires nvidia-container-toolkit
docker run --gpus all -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:light-cuda \
  -m /models/model.gguf -ngl 99

Requires nvidia-container-toolkit installed on the host.

AMD GPU support:

# Pull ROCm-enabled image
docker pull ghcr.io/ggml-org/llama.cpp:full-rocm
docker pull ghcr.io/ggml-org/llama.cpp:light-rocm
docker pull ghcr.io/ggml-org/llama.cpp:server-rocm

Additional GPU backends:

# Intel GPU (SYCL)
docker pull ghcr.io/ggml-org/llama.cpp:full-intel

# Moore Threads GPU (MUSA)
docker pull ghcr.io/ggml-org/llama.cpp:full-musa

# Vulkan
docker pull ghcr.io/ggml-org/llama.cpp:full-vulkan

Pre-Built Binaries

Download pre-compiled binaries directly from GitHub:

Visit the releases page
Download the appropriate binary for your platform
Extract and add to your PATH

Pre-built binaries may not include all GPU backends. For full GPU support, build from source.

Building from Source

Building from source enables GPU acceleration and custom configurations.

Get the source code

Clone the repository:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Choose your build configuration

Select the appropriate build for your hardware:

Basic CPU build with no dependencies:

cmake -B build
cmake --build build --config Release

For faster compilation:

# Use multiple jobs
cmake --build build --config Release -j 8

# Or use Ninja generator
cmake -B build -G Ninja
cmake --build build --config Release

Optional: OpenBLAS for better CPU performance

Enable BLAS acceleration for faster prompt processing:

# Install OpenBLAS first
# Ubuntu/Debian: sudo apt-get install libopenblas-dev
# Fedora: sudo dnf install openblas-devel
# macOS: brew install openblas

cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release

Build with CUDA support for NVIDIA GPUs:Prerequisites:

CUDA Toolkit installed
NVIDIA GPU with compute capability 3.5 or higher

Build:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Specify GPU architecture

For faster compilation targeting specific GPUs:

# Check your GPU's compute capability:
# RTX 4090: 8.9
# RTX 3080: 8.6
# RTX 2080: 7.5

cmake -B build -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="86;89"
cmake --build build --config Release

Multiple CUDA versions

If you have multiple CUDA installations:

cmake -B build -DGGML_CUDA=ON \
  -DCMAKE_CUDA_COMPILER=/opt/cuda-11.7/bin/nvcc \
  -DCMAKE_INSTALL_RPATH="/opt/cuda-11.7/lib64;\$ORIGIN" \
  -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON
cmake --build build --config Release

Optimize for multi-GPU setups

For better performance with multiple GPUs:

# Set at runtime
CUDA_SCALE_LAUNCH_QUEUES=4x ./build/bin/llama-cli -m model.gguf -ngl 99

This increases CUDA’s command buffer size for better pipeline parallelism.

Build with ROCm/HIP support for AMD GPUs:Prerequisites:

ROCm installed

Build for Linux:

# Automatic GPU detection
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
  cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j 16

Or target specific GPU:

# For RX 7900 XT/XTX (gfx1100)
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
  cmake -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100
cmake --build build --config Release

Find your GPU architecture

# Get your GPU's architecture
rocminfo | grep gfx | head -1 | awk '{print $2}'

# Common architectures:
# gfx1030 = RX 6000 series
# gfx1100 = RX 7900 XT/XTX
# gfx90a = MI200 series

Windows build (ROCm)

# In x64 Native Tools Command Prompt for VS
set PATH=%HIP_PATH%\bin;%PATH%
cmake -B build -G Ninja -DGGML_HIP=ON \
  -DGPU_TARGETS=gfx1100 \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build

Metal is enabled by default on macOS:

cmake -B build
cmake --build build --config Release

Metal provides excellent performance on M1/M2/M3 chips.

Disable Metal

If you need CPU-only on macOS:

cmake -B build -DGGML_METAL=OFF
cmake --build build --config Release

At runtime, you can also disable GPU:

llama-cli -m model.gguf --n-gpu-layers 0

Cross-platform GPU support via Vulkan:Prerequisites:

Vulkan SDK installed

Linux:

# Using system packages
sudo apt-get install libvulkan-dev glslc  # Ubuntu/Debian

# Or source the Vulkan SDK
source /path/to/vulkan-sdk/setup-env.sh

# Build
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release

Windows (w64devkit):

# Copy Vulkan dependencies
SDK_VERSION=1.3.283.0
cp /VulkanSDK/$SDK_VERSION/Bin/glslc.exe $W64DEVKIT_HOME/bin/
cp /VulkanSDK/$SDK_VERSION/Lib/vulkan-1.lib \
  $W64DEVKIT_HOME/x86_64-w64-mingw32/lib/

# Build
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

macOS:

# Install Vulkan SDK with KosmicKrisp
source /path/to/vulkan-sdk/setup-env.sh

# Use KosmicKrisp for better performance
export VK_ICD_FILENAMES=$VULKAN_SDK/share/vulkan/icd.d/libkosmickrisp_icd.json

# Build (disable Metal)
cmake -B build -DGGML_VULKAN=1 -DGGML_METAL=OFF
cmake --build build --config Release

Install (optional)

Install the binaries to your system:

# Install to /usr/local/bin
sudo cmake --install build

# Or specify custom prefix
cmake --install build --prefix ~/.local

Or use directly from the build directory:

./build/bin/llama-cli -m model.gguf

Verify GPU support

Check that GPU acceleration is working:

# List available devices
./build/bin/llama-cli --list-devices

# Run with GPU
./build/bin/llama-cli -m model.gguf -ngl 99

You should see output indicating GPU layers are loaded:

llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: offloading 32 layers to GPU

Platform-Specific Instructions

Windows with Visual Studio

Prerequisites:

Visual Studio 2022 with C++ development tools
CMake (included with VS)

Build:

# Open Developer Command Prompt for VS 2022
cmake -B build
cmake --build build --config Release

For ARM64 Windows:

cmake --preset arm64-windows-llvm-release -DGGML_OPENMP=OFF
cmake --build build-arm64-windows-llvm-release

Android

Building for Android requires the NDK. See the Android build guide for complete instructions.Quick example with OpenCL:

cmake .. -G Ninja \
  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
  -DANDROID_ABI=arm64-v8a \
  -DANDROID_PLATFORM=android-28 \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_OPENCL=ON
ninja

Static builds

For portable binaries without shared library dependencies:

cmake -B build \
  -DBUILD_SHARED_LIBS=OFF \
  -DCMAKE_POSITION_INDEPENDENT_CODE=ON
cmake --build build --config Release

Debug builds

Single-config generators (Make, Ninja):

cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build

Multi-config generators (Visual Studio, Xcode):

cmake -B build -G "Xcode"
cmake --build build --config Debug

Advanced Build Options

Performance
Multiple Backends
Special Hardware
SSL/TLS Support

ccache for faster rebuilds:

# Install ccache
# Ubuntu/Debian: sudo apt-get install ccache
# macOS: brew install ccache

# CMake will detect and use it automatically
cmake -B build
cmake --build build --config Release

Intel oneMKL for better CPU performance:

source /opt/intel/oneapi/setvars.sh
cmake -B build \
  -DGGML_BLAS=ON \
  -DGGML_BLAS_VENDOR=Intel10_64lp \
  -DCMAKE_C_COMPILER=icx \
  -DCMAKE_CXX_COMPILER=icpx \
  -DGGML_NATIVE=ON
cmake --build build --config Release

Build with multiple GPU backends:

# CUDA + Vulkan
cmake -B build -DGGML_CUDA=ON -DGGML_VULKAN=ON
cmake --build build --config Release

# Select backend at runtime
./build/bin/llama-cli --list-devices
./build/bin/llama-cli -m model.gguf --device cuda

Dynamic backend loading:

# Build backends as dynamic libraries
cmake -B build -DGGML_BACKEND_DL=ON
cmake --build build --config Release

This allows using the same binary on different machines with different GPUs.

AMD CPU (ZenDNN):

cmake -B build -DGGML_ZENDNN=ON
cmake --build build --config Release

Arm CPUs (KleidiAI):

cmake -B build -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release

Ascend NPU (CANN):

cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release
cmake --build build --config release

Enable HTTPS support for llama-server:

# Install OpenSSL development libraries first
# Ubuntu/Debian: sudo apt-get install libssl-dev
# Fedora: sudo dnf install openssl-devel
# macOS: brew install openssl

cmake -B build
cmake --build build --config Release

CMake will automatically detect and enable OpenSSL if installed.

Verifying Installation

Test your installation:

# Check version
llama-cli --version

# List available compute devices
llama-cli --list-devices

# Run a simple test
llama-cli -m path/to/model.gguf -p "Test prompt" -n 10

Expected output should show:

Version information
Detected backends (CUDA, Metal, etc.)
Model loading messages
Generated text

Troubleshooting

CMake can't find CUDA

Issue: Could NOT find CUDASolutions:

# Verify CUDA is installed
nvcc --version

# Set CUDA path explicitly
export CUDA_PATH=/usr/local/cuda
cmake -B build -DGGML_CUDA=ON

ROCm build errors

Issue: cannot find ROCm device librarySolution:

# Find the directory containing oclc_abi_version_400.bc
find $HIP_PATH -name "oclc_abi_version_400.bc"

# Set HIP_DEVICE_LIB_PATH
export HIP_DEVICE_LIB_PATH=/path/to/found/directory
cmake -B build -DGGML_HIP=ON

Vulkan SDK not found

Issue: CMake can’t find VulkanSolutions:

# Make sure to source the setup script
source /path/to/vulkan-sdk/setup-env.sh

# Verify Vulkan is available
vulkaninfo

# Rebuild
cmake -B build -DGGML_VULKAN=1

Compilation is very slow

Solutions:

# Use parallel jobs
cmake --build build --config Release -j $(nproc)

# Or use Ninja (faster than Make)
cmake -B build -G Ninja
cmake --build build

# Install ccache
sudo apt-get install ccache  # Ubuntu/Debian
brew install ccache          # macOS

Runtime GPU errors

Issue: GPU detected but errors during inferenceSolutions:

Update GPU drivers to the latest version
Try reducing layers offloaded: -ngl 30 instead of -ngl 99
Check VRAM usage: ensure model fits in available memory
For CUDA: try setting CUDA_VISIBLE_DEVICES=0
For ROCm: try setting HIP_VISIBLE_DEVICES=0

Next Steps

Quick Start

Learn how to run your first inference and use common features

CLI Reference

Explore all available command-line options

Build Documentation

Detailed build instructions and advanced configurations

Docker Guide

Complete Docker setup and usage guide

Get Started

Core Concepts

Inference

Models

Advanced

Package Managers (Recommended for CPU)

Homebrew

Winget

MacPorts

Nix

Docker Images

Pre-Built Binaries

Building from Source

Platform-Specific Instructions

Advanced Build Options

Verifying Installation

Troubleshooting

Next Steps

Quick Start

CLI Reference

Build Documentation

Docker Guide

Get Started

Core Concepts

Inference

Models

Advanced

​Package Managers (Recommended for CPU)

Homebrew

Winget

MacPorts

Nix

​Docker Images

​Pre-Built Binaries

​Building from Source

​Platform-Specific Instructions

​Advanced Build Options

​Verifying Installation

​Troubleshooting

​Next Steps

Quick Start

CLI Reference

Build Documentation

Docker Guide

Package Managers (Recommended for CPU)

Docker Images

Pre-Built Binaries

Building from Source

Platform-Specific Instructions

Advanced Build Options

Verifying Installation

Troubleshooting

Next Steps