Skip to main content
llama.cpp can be installed via package managers for quick setup, or built from source for GPU acceleration and custom configurations. The easiest way to install llama.cpp is through your system’s package manager. These pre-built binaries work out-of-the-box but typically only include CPU support.

Homebrew

macOS and Linux
brew install llama.cpp
Automatically updated with new releases.

Winget

Windows
winget install llama.cpp
Automatically updated with new releases.

MacPorts

macOS
sudo port install llama.cpp
More info →

Nix

macOS and Linux
# Flake-enabled
nix profile install nixpkgs#llama-cpp

# Traditional
nix-env --file '<nixpkgs>' --install --attr llama-cpp
Package manager installations are ideal for getting started quickly. For GPU acceleration, you’ll need to build from source.

Docker Images

Pre-built Docker images are available with and without GPU support:
Three image variants are available:
# Full image: includes CLI, completion, and conversion tools
docker pull ghcr.io/ggml-org/llama.cpp:full

# Light image: CLI and completion only
docker pull ghcr.io/ggml-org/llama.cpp:light

# Server image: HTTP server only
docker pull ghcr.io/ggml-org/llama.cpp:server
Example usage:
# Run inference
docker run -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:light \
  -m /models/model.gguf -p "Hello world"

# Start server
docker run -v /path/to/models:/models -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/model.gguf --port 8080 --host 0.0.0.0

Pre-Built Binaries

Download pre-compiled binaries directly from GitHub:
  1. Visit the releases page
  2. Download the appropriate binary for your platform
  3. Extract and add to your PATH
Pre-built binaries may not include all GPU backends. For full GPU support, build from source.

Building from Source

Building from source enables GPU acceleration and custom configurations.
1

Get the source code

Clone the repository:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
2

Choose your build configuration

Select the appropriate build for your hardware:
Basic CPU build with no dependencies:
cmake -B build
cmake --build build --config Release
For faster compilation:
# Use multiple jobs
cmake --build build --config Release -j 8

# Or use Ninja generator
cmake -B build -G Ninja
cmake --build build --config Release
Enable BLAS acceleration for faster prompt processing:
# Install OpenBLAS first
# Ubuntu/Debian: sudo apt-get install libopenblas-dev
# Fedora: sudo dnf install openblas-devel
# macOS: brew install openblas

cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release
3

Install (optional)

Install the binaries to your system:
# Install to /usr/local/bin
sudo cmake --install build

# Or specify custom prefix
cmake --install build --prefix ~/.local
Or use directly from the build directory:
./build/bin/llama-cli -m model.gguf
4

Verify GPU support

Check that GPU acceleration is working:
# List available devices
./build/bin/llama-cli --list-devices

# Run with GPU
./build/bin/llama-cli -m model.gguf -ngl 99
You should see output indicating GPU layers are loaded:
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: offloading 32 layers to GPU

Platform-Specific Instructions

Prerequisites:
  • Visual Studio 2022 with C++ development tools
  • CMake (included with VS)
Build:
# Open Developer Command Prompt for VS 2022
cmake -B build
cmake --build build --config Release
For ARM64 Windows:
cmake --preset arm64-windows-llvm-release -DGGML_OPENMP=OFF
cmake --build build-arm64-windows-llvm-release
Building for Android requires the NDK. See the Android build guide for complete instructions.Quick example with OpenCL:
cmake .. -G Ninja \
  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
  -DANDROID_ABI=arm64-v8a \
  -DANDROID_PLATFORM=android-28 \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_OPENCL=ON
ninja
For portable binaries without shared library dependencies:
cmake -B build \
  -DBUILD_SHARED_LIBS=OFF \
  -DCMAKE_POSITION_INDEPENDENT_CODE=ON
cmake --build build --config Release
Single-config generators (Make, Ninja):
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
Multi-config generators (Visual Studio, Xcode):
cmake -B build -G "Xcode"
cmake --build build --config Debug

Advanced Build Options

ccache for faster rebuilds:
# Install ccache
# Ubuntu/Debian: sudo apt-get install ccache
# macOS: brew install ccache

# CMake will detect and use it automatically
cmake -B build
cmake --build build --config Release
Intel oneMKL for better CPU performance:
source /opt/intel/oneapi/setvars.sh
cmake -B build \
  -DGGML_BLAS=ON \
  -DGGML_BLAS_VENDOR=Intel10_64lp \
  -DCMAKE_C_COMPILER=icx \
  -DCMAKE_CXX_COMPILER=icpx \
  -DGGML_NATIVE=ON
cmake --build build --config Release

Verifying Installation

Test your installation:
# Check version
llama-cli --version

# List available compute devices
llama-cli --list-devices

# Run a simple test
llama-cli -m path/to/model.gguf -p "Test prompt" -n 10
Expected output should show:
  • Version information
  • Detected backends (CUDA, Metal, etc.)
  • Model loading messages
  • Generated text

Troubleshooting

Issue: Could NOT find CUDASolutions:
# Verify CUDA is installed
nvcc --version

# Set CUDA path explicitly
export CUDA_PATH=/usr/local/cuda
cmake -B build -DGGML_CUDA=ON
Issue: cannot find ROCm device librarySolution:
# Find the directory containing oclc_abi_version_400.bc
find $HIP_PATH -name "oclc_abi_version_400.bc"

# Set HIP_DEVICE_LIB_PATH
export HIP_DEVICE_LIB_PATH=/path/to/found/directory
cmake -B build -DGGML_HIP=ON
Issue: CMake can’t find VulkanSolutions:
# Make sure to source the setup script
source /path/to/vulkan-sdk/setup-env.sh

# Verify Vulkan is available
vulkaninfo

# Rebuild
cmake -B build -DGGML_VULKAN=1
Solutions:
# Use parallel jobs
cmake --build build --config Release -j $(nproc)

# Or use Ninja (faster than Make)
cmake -B build -G Ninja
cmake --build build

# Install ccache
sudo apt-get install ccache  # Ubuntu/Debian
brew install ccache          # macOS
Issue: GPU detected but errors during inferenceSolutions:
  • Update GPU drivers to the latest version
  • Try reducing layers offloaded: -ngl 30 instead of -ngl 99
  • Check VRAM usage: ensure model fits in available memory
  • For CUDA: try setting CUDA_VISIBLE_DEVICES=0
  • For ROCm: try setting HIP_VISIBLE_DEVICES=0

Next Steps

Quick Start

Learn how to run your first inference and use common features

CLI Reference

Explore all available command-line options

Build Documentation

Detailed build instructions and advanced configurations

Docker Guide

Complete Docker setup and usage guide