Skip to main content

Getting the Code

Clone the repository from GitHub:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

CPU Build

1

Configure the build

Use CMake to configure the build directory:
cmake -B build
2

Build the project

Compile with CMake:
cmake --build build --config Release
For faster compilation, add -j to run multiple jobs in parallel:
cmake --build build --config Release -j 8
3

Run the binary

After building, binaries are located in build/bin/:
./build/bin/llama-cli --help

Debug Builds

For debug builds, the process differs based on your generator:
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build

Static Builds

To build static libraries instead of shared:
cmake -B build -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release

Performance Tips

  • ccache: Install ccache for faster repeated compilation
  • Parallel builds: Use -j flag with the number of CPU cores
  • Generators: Use Ninja generator for automatic parallelization: cmake -B build -G Ninja

Metal Build (macOS)

On macOS, Metal is enabled by default for GPU acceleration.
cmake -B build
cmake --build build --config Release
Metal makes computations run on the GPU. To disable Metal at compile time:
cmake -B build -DGGML_METAL=OFF
At runtime, you can disable GPU inference with:
./build/bin/llama-cli -m model.gguf --n-gpu-layers 0

CUDA Build (NVIDIA GPU)

For NVIDIA GPU acceleration, ensure you have the CUDA toolkit installed.
1

Install CUDA toolkit

Download from the NVIDIA developer site and follow installation instructions for your platform.
2

Build with CUDA support

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
3

Verify CUDA is working

Run with GPU layers:
./build/bin/llama-cli -m model.gguf -ngl 99

Non-Native CUDA Builds

By default, llama.cpp builds for GPUs connected to your system. For a build covering all CUDA GPUs:
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=OFF
cmake --build build --config Release
This results in a larger binary and longer compilation time, but the binary will run on any CUDA GPU.

Override Compute Capability

If nvcc cannot detect your GPU, explicitly specify architectures:
1

Find your GPU's compute capability

Check NVIDIA’s CUDA GPUs page for your GPU’s compute capability.Examples:
  • GeForce RTX 4090: 8.9
  • GeForce RTX 3080 Ti: 8.6
  • GeForce RTX 3070: 8.6
2

Build with specific architectures

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;89"
cmake --build build --config Release

CUDA Runtime Variables

Control CUDA behavior with environment variables:
# Hide specific devices
CUDA_VISIBLE_DEVICES="-0" ./build/bin/llama-server -m model.gguf

# Increase command buffer for multi-GPU setups
CUDA_SCALE_LAUNCH_QUEUES=4x ./build/bin/llama-cli -m model.gguf

# Enable unified memory (allows RAM fallback)
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./build/bin/llama-cli -m model.gguf

HIP Build (AMD GPU)

For AMD GPU acceleration using ROCm/HIP:
1

Install ROCm

Install ROCm from your Linux distro’s package manager or from the ROCm Quick Start guide.
2

Build with HIP support

For a gfx1030-compatible AMD GPU:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -- -j 16
GPU_TARGETS is optional. Omitting it will build for all GPUs in the current system.
3

Find your GPU architecture

Find your GPU version:
rocminfo | grep gfx | head -1 | awk '{print $2}'
Match with LLVM’s processor list. For example, gfx1035 maps to gfx1030.

Windows HIP Build

Using x64 Native Tools Command Prompt for VS:
set PATH=%HIP_PATH%\bin;%PATH%
cmake -S . -B build -G Ninja -DGPU_TARGETS=gfx1100 -DGGML_HIP=ON \
    -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build

Vulkan Build

Vulkan provides cross-platform GPU acceleration.

Windows

1

Install dependencies

  1. Download and extract w64devkit
  2. Install the Vulkan SDK
2

Copy Vulkan dependencies

Launch w64devkit.exe and run:
SDK_VERSION=1.3.283.0
cp /VulkanSDK/$SDK_VERSION/Bin/glslc.exe $W64DEVKIT_HOME/bin/
cp /VulkanSDK/$SDK_VERSION/Lib/vulkan-1.lib $W64DEVKIT_HOME/x86_64-w64-mingw32/lib/
cp -r /VulkanSDK/$SDK_VERSION/Include/* $W64DEVKIT_HOME/x86_64-w64-mingw32/include/
3

Build

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Linux

Install required dependencies:
Debian/Ubuntu
sudo apt-get install libvulkan-dev glslc
Then build:
vulkaninfo  # Verify Vulkan is working
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release

macOS

1

Install Vulkan SDK

Follow the Getting Started with the MacOS Vulkan SDK guide.
Check the “KosmicKrisp” box during installation for better performance.
2

Set environment variables

source /path/to/vulkan-sdk/setup-env.sh
For KosmicKrisp (better performance):
export VK_ICD_FILENAMES=$VULKAN_SDK/share/vulkan/icd.d/libkosmickrisp_icd.json
export VK_DRIVER_FILES=$VULKAN_SDK/share/vulkan/icd.d/libkosmickrisp_icd.json
3

Build

cmake -B build -DGGML_VULKAN=1 -DGGML_METAL=OFF
cmake --build build --config Release

BLAS Build

BLAS support can improve prompt processing performance for batch sizes > 32.

Accelerate Framework (macOS)

Enabled by default on macOS. Just build normally:
cmake -B build
cmake --build build --config Release

OpenBLAS (Linux)

Install OpenBLAS and build:
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release

Intel oneMKL

source /opt/intel/oneapi/setvars.sh
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp \
    -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=ON
cmake --build build --config Release

SYCL Build (Intel GPU)

For Intel GPU support (Data Center Max, Flex, Arc series):
cmake -B build -DGGML_SYCL=ON
cmake --build build --config Release
See the SYCL backend documentation for detailed information.

Platform-Specific Builds

Windows

1

Install Visual Studio 2022

Install Visual Studio 2022 Community Edition.Select these components:
  • Workload: Desktop development with C++
  • Components:
    • C++ CMake Tools for Windows
    • Git for Windows
    • C++ Clang Compiler for Windows
    • MS-Build Support for LLVM-Toolset (clang)
2

Use Developer Command Prompt

Always use a Developer Command Prompt or PowerShell for VS2022.
3

Build

For Windows on ARM (WoA):
cmake --preset arm64-windows-llvm-release -D GGML_OPENMP=OFF
cmake --build build-arm64-windows-llvm-release
For x64 with Ninja and clang:
cmake --preset x64-windows-llvm-release
cmake --build build-x64-windows-llvm-release

Android

See the Android build documentation for detailed instructions.

Additional Backends

CANN (Ascend NPU)

cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release
cmake --build build --config release

ZenDNN (AMD EPYC CPUs)

# Automatic build (first time takes 5-10 minutes)
cmake -B build -DGGML_ZENDNN=ON
cmake --build build --config Release

# With custom ZenDNN installation
cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/zendnn/install
cmake --build build --config Release

Arm KleidiAI

Optimized kernels for Arm CPUs:
cmake -B build -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release
For SME support, set GGML_KLEIDIAI_SME=1 at runtime.

OpenCL (Adreno GPU)

See the OpenCL backend documentation for Android and Windows ARM64 build instructions.

Multi-Backend Builds

You can build with multiple backends simultaneously:
cmake -B build -DGGML_CUDA=ON -DGGML_VULKAN=ON
cmake --build build --config Release
At runtime, specify which device to use:
# List available devices
./build/bin/llama-cli --list-devices

# Use specific device
./build/bin/llama-cli -m model.gguf --device cuda:0

# Disable GPU entirely
./build/bin/llama-cli -m model.gguf --device none

Dynamic Backend Loading

Build backends as dynamic libraries for portability:
cmake -B build -DGGML_BACKEND_DL=ON
cmake --build build --config Release
This allows using the same binary on different machines with different GPUs.

HTTPS/TLS Support

For HTTPS features, install OpenSSL development libraries:
sudo apt-get install libssl-dev
If not installed, llama.cpp will build and run without SSL support.