Building from Source

Getting the Code

Clone the repository from GitHub:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

CPU Build

Configure the build

Use CMake to configure the build directory:

cmake -B build

Build the project

Compile with CMake:

cmake --build build --config Release

For faster compilation, add -j to run multiple jobs in parallel:

cmake --build build --config Release -j 8

Run the binary

After building, binaries are located in build/bin/:

./build/bin/llama-cli --help

Debug Builds

For debug builds, the process differs based on your generator:

cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build

Static Builds

To build static libraries instead of shared:

cmake -B build -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release

Performance Tips

ccache: Install ccache for faster repeated compilation
Parallel builds: Use -j flag with the number of CPU cores
Generators: Use Ninja generator for automatic parallelization: cmake -B build -G Ninja

Metal Build (macOS)

On macOS, Metal is enabled by default for GPU acceleration.

cmake -B build
cmake --build build --config Release

Metal makes computations run on the GPU. To disable Metal at compile time:

cmake -B build -DGGML_METAL=OFF

At runtime, you can disable GPU inference with:

./build/bin/llama-cli -m model.gguf --n-gpu-layers 0

CUDA Build (NVIDIA GPU)

For NVIDIA GPU acceleration, ensure you have the CUDA toolkit installed.

Install CUDA toolkit

Download from the NVIDIA developer site and follow installation instructions for your platform.

Build with CUDA support

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Verify CUDA is working

Run with GPU layers:

./build/bin/llama-cli -m model.gguf -ngl 99

Non-Native CUDA Builds

By default, llama.cpp builds for GPUs connected to your system. For a build covering all CUDA GPUs:

cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=OFF
cmake --build build --config Release

This results in a larger binary and longer compilation time, but the binary will run on any CUDA GPU.

Override Compute Capability

If nvcc cannot detect your GPU, explicitly specify architectures:

Find your GPU's compute capability

Check NVIDIA’s CUDA GPUs page for your GPU’s compute capability.Examples:

GeForce RTX 4090: 8.9
GeForce RTX 3080 Ti: 8.6
GeForce RTX 3070: 8.6

Build with specific architectures

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;89"
cmake --build build --config Release

CUDA Runtime Variables

Control CUDA behavior with environment variables:

# Hide specific devices
CUDA_VISIBLE_DEVICES="-0" ./build/bin/llama-server -m model.gguf

# Increase command buffer for multi-GPU setups
CUDA_SCALE_LAUNCH_QUEUES=4x ./build/bin/llama-cli -m model.gguf

# Enable unified memory (allows RAM fallback)
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./build/bin/llama-cli -m model.gguf

HIP Build (AMD GPU)

For AMD GPU acceleration using ROCm/HIP:

Install ROCm

Install ROCm from your Linux distro’s package manager or from the ROCm Quick Start guide.

Build with HIP support

For a gfx1030-compatible AMD GPU:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -- -j 16

GPU_TARGETS is optional. Omitting it will build for all GPUs in the current system.

Find your GPU architecture

Find your GPU version:

rocminfo | grep gfx | head -1 | awk '{print $2}'

Match with LLVM’s processor list. For example, gfx1035 maps to gfx1030.

Windows HIP Build

Using x64 Native Tools Command Prompt for VS:

set PATH=%HIP_PATH%\bin;%PATH%
cmake -S . -B build -G Ninja -DGPU_TARGETS=gfx1100 -DGGML_HIP=ON \
    -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build

Vulkan Build

Vulkan provides cross-platform GPU acceleration.

Windows

w64devkit
Git Bash MINGW64
MSYS2

Install dependencies

Download and extract w64devkit
Install the Vulkan SDK

Copy Vulkan dependencies

Launch w64devkit.exe and run:

SDK_VERSION=1.3.283.0
cp /VulkanSDK/$SDK_VERSION/Bin/glslc.exe $W64DEVKIT_HOME/bin/
cp /VulkanSDK/$SDK_VERSION/Lib/vulkan-1.lib $W64DEVKIT_HOME/x86_64-w64-mingw32/lib/
cp -r /VulkanSDK/$SDK_VERSION/Include/* $W64DEVKIT_HOME/x86_64-w64-mingw32/include/

Build

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Install dependencies

Install Git-SCM
Install Visual Studio Community Edition with C++
Install CMake
Install Vulkan SDK

Build

Right-click in llama.cpp directory, select “Open Git Bash Here”:

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Install dependencies

Install MSYS2 and run in UCRT terminal:

pacman -S git \
    mingw-w64-ucrt-x86_64-gcc \
    mingw-w64-ucrt-x86_64-cmake \
    mingw-w64-ucrt-x86_64-vulkan-devel \
    mingw-w64-ucrt-x86_64-shaderc

Build

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Linux

System packages
LunarG SDK

Install required dependencies:

Debian/Ubuntu

sudo apt-get install libvulkan-dev glslc

Then build:

vulkaninfo  # Verify Vulkan is working
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release

Install Vulkan SDK

Follow the Getting Started with the Linux Tarball Vulkan SDK guide.

Source environment

source /path/to/vulkan-sdk/setup-env.sh

You must source this file in every terminal session where you build or run llama.cpp with Vulkan.

Build

vulkaninfo  # Verify setup
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release

macOS

Install Vulkan SDK

Follow the Getting Started with the MacOS Vulkan SDK guide.

Check the “KosmicKrisp” box during installation for better performance.

Set environment variables

source /path/to/vulkan-sdk/setup-env.sh

For KosmicKrisp (better performance):

export VK_ICD_FILENAMES=$VULKAN_SDK/share/vulkan/icd.d/libkosmickrisp_icd.json
export VK_DRIVER_FILES=$VULKAN_SDK/share/vulkan/icd.d/libkosmickrisp_icd.json

Build

cmake -B build -DGGML_VULKAN=1 -DGGML_METAL=OFF
cmake --build build --config Release

BLAS Build

BLAS support can improve prompt processing performance for batch sizes > 32.

Accelerate Framework (macOS)

Enabled by default on macOS. Just build normally:

cmake -B build
cmake --build build --config Release

OpenBLAS (Linux)

Install OpenBLAS and build:

cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release

Intel oneMKL

source /opt/intel/oneapi/setvars.sh
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp \
    -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=ON
cmake --build build --config Release

SYCL Build (Intel GPU)

For Intel GPU support (Data Center Max, Flex, Arc series):

cmake -B build -DGGML_SYCL=ON
cmake --build build --config Release

See the SYCL backend documentation for detailed information.

Platform-Specific Builds

Windows

Install Visual Studio 2022

Install Visual Studio 2022 Community Edition.Select these components:

Workload: Desktop development with C++
Components:
- C++ CMake Tools for Windows
- Git for Windows
- C++ Clang Compiler for Windows
- MS-Build Support for LLVM-Toolset (clang)

Use Developer Command Prompt

Always use a Developer Command Prompt or PowerShell for VS2022.

Build

For Windows on ARM (WoA):

cmake --preset arm64-windows-llvm-release -D GGML_OPENMP=OFF
cmake --build build-arm64-windows-llvm-release

For x64 with Ninja and clang:

cmake --preset x64-windows-llvm-release
cmake --build build-x64-windows-llvm-release

Android

See the Android build documentation for detailed instructions.

Additional Backends

CANN (Ascend NPU)

cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release
cmake --build build --config release

ZenDNN (AMD EPYC CPUs)

# Automatic build (first time takes 5-10 minutes)
cmake -B build -DGGML_ZENDNN=ON
cmake --build build --config Release

# With custom ZenDNN installation
cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/zendnn/install
cmake --build build --config Release

Arm KleidiAI

Optimized kernels for Arm CPUs:

cmake -B build -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release

For SME support, set GGML_KLEIDIAI_SME=1 at runtime.

OpenCL (Adreno GPU)

See the OpenCL backend documentation for Android and Windows ARM64 build instructions.

Multi-Backend Builds

You can build with multiple backends simultaneously:

cmake -B build -DGGML_CUDA=ON -DGGML_VULKAN=ON
cmake --build build --config Release

At runtime, specify which device to use:

# List available devices
./build/bin/llama-cli --list-devices

# Use specific device
./build/bin/llama-cli -m model.gguf --device cuda:0

# Disable GPU entirely
./build/bin/llama-cli -m model.gguf --device none

Dynamic Backend Loading

Build backends as dynamic libraries for portability:

cmake -B build -DGGML_BACKEND_DL=ON
cmake --build build --config Release

This allows using the same binary on different machines with different GPUs.

HTTPS/TLS Support

For HTTPS features, install OpenSSL development libraries:

sudo apt-get install libssl-dev

If not installed, llama.cpp will build and run without SSL support.

Building

Contributing

​Getting the Code

​CPU Build

​Debug Builds

​Static Builds

​Performance Tips

​Metal Build (macOS)

​CUDA Build (NVIDIA GPU)

​Non-Native CUDA Builds

​Override Compute Capability

​CUDA Runtime Variables

​HIP Build (AMD GPU)

​Windows HIP Build

​Vulkan Build

​Windows

​Linux

​macOS

​BLAS Build

​Accelerate Framework (macOS)

​OpenBLAS (Linux)

​Intel oneMKL

​SYCL Build (Intel GPU)

​Platform-Specific Builds

​Windows

​Android

​Additional Backends

​CANN (Ascend NPU)

​ZenDNN (AMD EPYC CPUs)

​Arm KleidiAI

​OpenCL (Adreno GPU)

​Multi-Backend Builds

​Dynamic Backend Loading

​HTTPS/TLS Support

Getting the Code

CPU Build

Debug Builds

Static Builds

Performance Tips

Metal Build (macOS)

CUDA Build (NVIDIA GPU)

Non-Native CUDA Builds

Override Compute Capability

CUDA Runtime Variables

HIP Build (AMD GPU)

Windows HIP Build

Vulkan Build

Windows

Linux

macOS

BLAS Build

Accelerate Framework (macOS)

OpenBLAS (Linux)

Intel oneMKL

SYCL Build (Intel GPU)

Platform-Specific Builds

Windows

Android

Additional Backends

CANN (Ascend NPU)

ZenDNN (AMD EPYC CPUs)

Arm KleidiAI

OpenCL (Adreno GPU)

Multi-Backend Builds

Dynamic Backend Loading

HTTPS/TLS Support