llama.cpp can be installed via package managers for quick setup, or built from source for GPU acceleration and custom configurations.
Package Managers (Recommended for CPU)
The easiest way to install llama.cpp is through your system’s package manager. These pre-built binaries work out-of-the-box but typically only include CPU support.
Homebrew macOS and Linux Automatically updated with new releases.
Winget Windows Automatically updated with new releases.
Nix macOS and Linux # Flake-enabled
nix profile install nixpkgs#llama-cpp
# Traditional
nix-env --file '<nixpkgs>' --install --attr llama-cpp
Package manager installations are ideal for getting started quickly. For GPU acceleration, you’ll need to build from source .
Docker Images
Pre-built Docker images are available with and without GPU support:
Three image variants are available: # Full image: includes CLI, completion, and conversion tools
docker pull ghcr.io/ggml-org/llama.cpp:full
# Light image: CLI and completion only
docker pull ghcr.io/ggml-org/llama.cpp:light
# Server image: HTTP server only
docker pull ghcr.io/ggml-org/llama.cpp:server
Example usage: # Run inference
docker run -v /path/to/models:/models \
ghcr.io/ggml-org/llama.cpp:light \
-m /models/model.gguf -p "Hello world"
# Start server
docker run -v /path/to/models:/models -p 8080:8080 \
ghcr.io/ggml-org/llama.cpp:server \
-m /models/model.gguf --port 8080 --host 0.0.0.0
NVIDIA GPU support: # Pull CUDA-enabled image
docker pull ghcr.io/ggml-org/llama.cpp:full-cuda
docker pull ghcr.io/ggml-org/llama.cpp:light-cuda
docker pull ghcr.io/ggml-org/llama.cpp:server-cuda
Example usage: # Requires nvidia-container-toolkit
docker run --gpus all -v /path/to/models:/models \
ghcr.io/ggml-org/llama.cpp:light-cuda \
-m /models/model.gguf -ngl 99
AMD GPU support: # Pull ROCm-enabled image
docker pull ghcr.io/ggml-org/llama.cpp:full-rocm
docker pull ghcr.io/ggml-org/llama.cpp:light-rocm
docker pull ghcr.io/ggml-org/llama.cpp:server-rocm
Additional GPU backends: # Intel GPU (SYCL)
docker pull ghcr.io/ggml-org/llama.cpp:full-intel
# Moore Threads GPU (MUSA)
docker pull ghcr.io/ggml-org/llama.cpp:full-musa
# Vulkan
docker pull ghcr.io/ggml-org/llama.cpp:full-vulkan
Pre-Built Binaries
Download pre-compiled binaries directly from GitHub:
Visit the releases page
Download the appropriate binary for your platform
Extract and add to your PATH
Pre-built binaries may not include all GPU backends. For full GPU support, build from source.
Building from Source
Building from source enables GPU acceleration and custom configurations.
Get the source code
Clone the repository: git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Choose your build configuration
Select the appropriate build for your hardware: CPU Only
NVIDIA GPU (CUDA)
AMD GPU (ROCm)
Apple Silicon (Metal)
Intel GPU (SYCL)
Vulkan
Basic CPU build with no dependencies: cmake -B build
cmake --build build --config Release
For faster compilation: # Use multiple jobs
cmake --build build --config Release -j 8
# Or use Ninja generator
cmake -B build -G Ninja
cmake --build build --config Release
Optional: OpenBLAS for better CPU performance
Build with CUDA support for NVIDIA GPUs: Prerequisites:
CUDA Toolkit installed
NVIDIA GPU with compute capability 3.5 or higher
Build: cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
For faster compilation targeting specific GPUs: # Check your GPU's compute capability:
# RTX 4090: 8.9
# RTX 3080: 8.6
# RTX 2080: 7.5
cmake -B build -DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES= "86;89"
cmake --build build --config Release
If you have multiple CUDA installations: cmake -B build -DGGML_CUDA=ON \
-DCMAKE_CUDA_COMPILER=/opt/cuda-11.7/bin/nvcc \
-DCMAKE_INSTALL_RPATH= "/opt/cuda-11.7/lib64; \$ ORIGIN" \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON
cmake --build build --config Release
Optimize for multi-GPU setups
For better performance with multiple GPUs: # Set at runtime
CUDA_SCALE_LAUNCH_QUEUES = 4x ./build/bin/llama-cli -m model.gguf -ngl 99
This increases CUDA’s command buffer size for better pipeline parallelism. Build with ROCm/HIP support for AMD GPUs: Prerequisites: Build for Linux: # Automatic GPU detection
HIPCXX = "$( hipconfig -l )/clang" HIP_PATH = "$( hipconfig -R )" \
cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j 16
Or target specific GPU: # For RX 7900 XT/XTX (gfx1100)
HIPCXX = "$( hipconfig -l )/clang" HIP_PATH = "$( hipconfig -R )" \
cmake -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1100
cmake --build build --config Release
Find your GPU architecture
# Get your GPU's architecture
rocminfo | grep gfx | head -1 | awk '{print $2}'
# Common architectures:
# gfx1030 = RX 6000 series
# gfx1100 = RX 7900 XT/XTX
# gfx90a = MI200 series
# In x64 Native Tools Command Prompt for VS
set PATH=%HIP_PATH% \b in ; %PATH%
cmake -B build -G Ninja -DGGML_HIP=ON \
-DGPU_TARGETS=gfx1100 \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_BUILD_TYPE=Release
cmake --build build
Metal is enabled by default on macOS: cmake -B build
cmake --build build --config Release
Metal provides excellent performance on M1/M2/M3 chips.
Build with SYCL for Intel Data Center, Flex, Arc, and integrated GPUs: See the SYCL backend documentation for detailed instructions. Cross-platform GPU support via Vulkan: Prerequisites: Linux: # Using system packages
sudo apt-get install libvulkan-dev glslc # Ubuntu/Debian
# Or source the Vulkan SDK
source /path/to/vulkan-sdk/setup-env.sh
# Build
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release
Windows (w64devkit): # Copy Vulkan dependencies
SDK_VERSION = 1.3.283.0
cp /VulkanSDK/ $SDK_VERSION /Bin/glslc.exe $W64DEVKIT_HOME /bin/
cp /VulkanSDK/ $SDK_VERSION /Lib/vulkan-1.lib \
$W64DEVKIT_HOME /x86_64-w64-mingw32/lib/
# Build
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release
macOS: # Install Vulkan SDK with KosmicKrisp
source /path/to/vulkan-sdk/setup-env.sh
# Use KosmicKrisp for better performance
export VK_ICD_FILENAMES = $VULKAN_SDK / share / vulkan / icd . d / libkosmickrisp_icd . json
# Build (disable Metal)
cmake -B build -DGGML_VULKAN=1 -DGGML_METAL=OFF
cmake --build build --config Release
Install (optional)
Install the binaries to your system: # Install to /usr/local/bin
sudo cmake --install build
# Or specify custom prefix
cmake --install build --prefix ~/.local
Or use directly from the build directory: ./build/bin/llama-cli -m model.gguf
Verify GPU support
Check that GPU acceleration is working: # List available devices
./build/bin/llama-cli --list-devices
# Run with GPU
./build/bin/llama-cli -m model.gguf -ngl 99
You should see output indicating GPU layers are loaded: llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: offloading 32 layers to GPU
Windows with Visual Studio
Prerequisites:
Visual Studio 2022 with C++ development tools
CMake (included with VS)
Build: # Open Developer Command Prompt for VS 2022
cmake -B build
cmake --build build --config Release
For ARM64 Windows: cmake --preset arm64-windows-llvm-release -DGGML_OPENMP=OFF
cmake --build build-arm64-windows-llvm-release
Building for Android requires the NDK. See the Android build guide for complete instructions. Quick example with OpenCL: cmake .. -G Ninja \
-DCMAKE_TOOLCHAIN_FILE= $ANDROID_NDK /build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-28 \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_OPENCL=ON
ninja
For portable binaries without shared library dependencies: cmake -B build \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON
cmake --build build --config Release
Single-config generators (Make, Ninja): cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
Multi-config generators (Visual Studio, Xcode): cmake -B build -G "Xcode"
cmake --build build --config Debug
Advanced Build Options
Verifying Installation
Test your installation:
# Check version
llama-cli --version
# List available compute devices
llama-cli --list-devices
# Run a simple test
llama-cli -m path/to/model.gguf -p "Test prompt" -n 10
Expected output should show:
Version information
Detected backends (CUDA, Metal, etc.)
Model loading messages
Generated text
Troubleshooting
Issue: Could NOT find CUDASolutions: # Verify CUDA is installed
nvcc --version
# Set CUDA path explicitly
export CUDA_PATH = / usr / local / cuda
cmake -B build -DGGML_CUDA=ON
Issue: cannot find ROCm device librarySolution: # Find the directory containing oclc_abi_version_400.bc
find $HIP_PATH -name "oclc_abi_version_400.bc"
# Set HIP_DEVICE_LIB_PATH
export HIP_DEVICE_LIB_PATH = / path / to / found / directory
cmake -B build -DGGML_HIP=ON
Issue: CMake can’t find VulkanSolutions: # Make sure to source the setup script
source /path/to/vulkan-sdk/setup-env.sh
# Verify Vulkan is available
vulkaninfo
# Rebuild
cmake -B build -DGGML_VULKAN=1
Solutions: # Use parallel jobs
cmake --build build --config Release -j $( nproc )
# Or use Ninja (faster than Make)
cmake -B build -G Ninja
cmake --build build
# Install ccache
sudo apt-get install ccache # Ubuntu/Debian
brew install ccache # macOS
Issue: GPU detected but errors during inferenceSolutions:
Update GPU drivers to the latest version
Try reducing layers offloaded: -ngl 30 instead of -ngl 99
Check VRAM usage: ensure model fits in available memory
For CUDA: try setting CUDA_VISIBLE_DEVICES=0
For ROCm: try setting HIP_VISIBLE_DEVICES=0
Next Steps
Quick Start Learn how to run your first inference and use common features
CLI Reference Explore all available command-line options
Build Documentation Detailed build instructions and advanced configurations
Docker Guide Complete Docker setup and usage guide