Getting the Code
Clone the repository from GitHub:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
CPU Build
Configure the build
Use CMake to configure the build directory:
Build the project
Compile with CMake: cmake --build build --config Release
For faster compilation, add -j to run multiple jobs in parallel: cmake --build build --config Release -j 8
Run the binary
After building, binaries are located in build/bin/: ./build/bin/llama-cli --help
Debug Builds
For debug builds, the process differs based on your generator:
Single-config generators (Unix Makefiles)
Multi-config generators (Visual Studio, Xcode)
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
Static Builds
To build static libraries instead of shared:
cmake -B build -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release
ccache : Install ccache for faster repeated compilation
Parallel builds : Use -j flag with the number of CPU cores
Generators : Use Ninja generator for automatic parallelization: cmake -B build -G Ninja
On macOS, Metal is enabled by default for GPU acceleration.
cmake -B build
cmake --build build --config Release
Metal makes computations run on the GPU. To disable Metal at compile time: cmake -B build -DGGML_METAL=OFF
At runtime, you can disable GPU inference with:
./build/bin/llama-cli -m model.gguf --n-gpu-layers 0
CUDA Build (NVIDIA GPU)
For NVIDIA GPU acceleration, ensure you have the CUDA toolkit installed.
Build with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
Verify CUDA is working
Run with GPU layers: ./build/bin/llama-cli -m model.gguf -ngl 99
Non-Native CUDA Builds
By default, llama.cpp builds for GPUs connected to your system. For a build covering all CUDA GPUs:
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=OFF
cmake --build build --config Release
This results in a larger binary and longer compilation time, but the binary will run on any CUDA GPU.
Override Compute Capability
If nvcc cannot detect your GPU, explicitly specify architectures:
Find your GPU's compute capability
Check NVIDIA’s CUDA GPUs page for your GPU’s compute capability. Examples:
GeForce RTX 4090: 8.9
GeForce RTX 3080 Ti: 8.6
GeForce RTX 3070: 8.6
Build with specific architectures
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES= "86;89"
cmake --build build --config Release
CUDA Runtime Variables
Control CUDA behavior with environment variables:
# Hide specific devices
CUDA_VISIBLE_DEVICES = "-0" ./build/bin/llama-server -m model.gguf
# Increase command buffer for multi-GPU setups
CUDA_SCALE_LAUNCH_QUEUES = 4x ./build/bin/llama-cli -m model.gguf
# Enable unified memory (allows RAM fallback)
GGML_CUDA_ENABLE_UNIFIED_MEMORY = 1 ./build/bin/llama-cli -m model.gguf
HIP Build (AMD GPU)
For AMD GPU acceleration using ROCm/HIP:
Build with HIP support
For a gfx1030-compatible AMD GPU: HIPCXX = "$( hipconfig -l )/clang" HIP_PATH = "$( hipconfig -R )" \
cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -- -j 16
GPU_TARGETS is optional. Omitting it will build for all GPUs in the current system.
Find your GPU architecture
Find your GPU version: rocminfo | grep gfx | head -1 | awk '{print $2}'
Match with LLVM’s processor list . For example, gfx1035 maps to gfx1030.
Windows HIP Build
Using x64 Native Tools Command Prompt for VS:
set PATH=%HIP_PATH% \b in ; %PATH%
cmake -S . -B build -G Ninja -DGPU_TARGETS=gfx1100 -DGGML_HIP=ON \
-DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_BUILD_TYPE=Release
cmake --build build
Vulkan Build
Vulkan provides cross-platform GPU acceleration.
Windows
w64devkit
Git Bash MINGW64
MSYS2
Copy Vulkan dependencies
Launch w64devkit.exe and run: SDK_VERSION = 1.3.283.0
cp /VulkanSDK/ $SDK_VERSION /Bin/glslc.exe $W64DEVKIT_HOME /bin/
cp /VulkanSDK/ $SDK_VERSION /Lib/vulkan-1.lib $W64DEVKIT_HOME /x86_64-w64-mingw32/lib/
cp -r /VulkanSDK/ $SDK_VERSION /Include/ * $W64DEVKIT_HOME /x86_64-w64-mingw32/include/
Build
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release
Build
Right-click in llama.cpp directory, select “Open Git Bash Here”: cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release
Install dependencies
Install MSYS2 and run in UCRT terminal: pacman -S git \
mingw-w64-ucrt-x86_64-gcc \
mingw-w64-ucrt-x86_64-cmake \
mingw-w64-ucrt-x86_64-vulkan-devel \
mingw-w64-ucrt-x86_64-shaderc
Build
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release
Linux
System packages
LunarG SDK
Install required dependencies: sudo apt-get install libvulkan-dev glslc
Then build: vulkaninfo # Verify Vulkan is working
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release
Source environment
source /path/to/vulkan-sdk/setup-env.sh
You must source this file in every terminal session where you build or run llama.cpp with Vulkan.
Build
vulkaninfo # Verify setup
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release
macOS
Set environment variables
source /path/to/vulkan-sdk/setup-env.sh
For KosmicKrisp (better performance): export VK_ICD_FILENAMES = $VULKAN_SDK / share / vulkan / icd . d / libkosmickrisp_icd . json
export VK_DRIVER_FILES = $VULKAN_SDK / share / vulkan / icd . d / libkosmickrisp_icd . json
Build
cmake -B build -DGGML_VULKAN=1 -DGGML_METAL=OFF
cmake --build build --config Release
BLAS Build
BLAS support can improve prompt processing performance for batch sizes > 32.
Accelerate Framework (macOS)
Enabled by default on macOS. Just build normally:
cmake -B build
cmake --build build --config Release
OpenBLAS (Linux)
Install OpenBLAS and build:
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release
Intel oneMKL
Manual installation
Docker image
source /opt/intel/oneapi/setvars.sh
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp \
-DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=ON
cmake --build build --config Release
SYCL Build (Intel GPU)
For Intel GPU support (Data Center Max, Flex, Arc series):
cmake -B build -DGGML_SYCL=ON
cmake --build build --config Release
See the SYCL backend documentation for detailed information.
Windows
Install Visual Studio 2022
Install Visual Studio 2022 Community Edition . Select these components:
Workload : Desktop development with C++
Components :
C++ CMake Tools for Windows
Git for Windows
C++ Clang Compiler for Windows
MS-Build Support for LLVM-Toolset (clang)
Use Developer Command Prompt
Always use a Developer Command Prompt or PowerShell for VS2022.
Build
For Windows on ARM (WoA): cmake --preset arm64-windows-llvm-release -D GGML_OPENMP=OFF
cmake --build build-arm64-windows-llvm-release
For x64 with Ninja and clang: cmake --preset x64-windows-llvm-release
cmake --build build-x64-windows-llvm-release
Android
See the Android build documentation for detailed instructions.
Additional Backends
CANN (Ascend NPU)
cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release
cmake --build build --config release
ZenDNN (AMD EPYC CPUs)
# Automatic build (first time takes 5-10 minutes)
cmake -B build -DGGML_ZENDNN=ON
cmake --build build --config Release
# With custom ZenDNN installation
cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/zendnn/install
cmake --build build --config Release
Arm KleidiAI
Optimized kernels for Arm CPUs:
cmake -B build -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release
For SME support, set GGML_KLEIDIAI_SME=1 at runtime.
OpenCL (Adreno GPU)
See the OpenCL backend documentation for Android and Windows ARM64 build instructions.
Multi-Backend Builds
You can build with multiple backends simultaneously:
cmake -B build -DGGML_CUDA=ON -DGGML_VULKAN=ON
cmake --build build --config Release
At runtime, specify which device to use:
# List available devices
./build/bin/llama-cli --list-devices
# Use specific device
./build/bin/llama-cli -m model.gguf --device cuda:0
# Disable GPU entirely
./build/bin/llama-cli -m model.gguf --device none
Dynamic Backend Loading
Build backends as dynamic libraries for portability:
cmake -B build -DGGML_BACKEND_DL=ON
cmake --build build --config Release
This allows using the same binary on different machines with different GPUs.
HTTPS/TLS Support
For HTTPS features, install OpenSSL development libraries:
Debian/Ubuntu
Fedora/RHEL/Rocky/Alma
Arch/Manjaro
sudo apt-get install libssl-dev
If not installed, llama.cpp will build and run without SSL support.