Compute Backends

llama.cpp supports a wide variety of compute backends to accelerate inference across different hardware platforms. Each backend is optimized for specific hardware architectures.

Supported Backends

Backend	Target Devices	Documentation
Metal	Apple Silicon (M1/M2/M3/M4)	Build Guide
CUDA	NVIDIA GPUs	Build Guide
HIP	AMD GPUs	Build Guide
Vulkan	Cross-platform GPUs	Build Guide
SYCL	Intel GPUs & Nvidia GPUs
CANN	Ascend NPUs (Huawei)
MUSA	Moore Threads GPUs	Build Guide
OpenCL	Adreno GPUs (Qualcomm)
BLAS	CPU acceleration (all platforms)	Build Guide
BLIS	CPU acceleration (AMD optimized)
ZenDNN	AMD CPUs
zDNN	IBM Z & LinuxONE
RPC	Remote compute	RPC Documentation
VirtGPU	VirtGPU API
Hexagon	Snapdragon DSP
WebGPU	Browser/WASM	Build Guide (In Progress)

Quick Selection Guide

Apple Silicon Macs

Metal (default)Automatically enabled on macOS. Uses GPU and Neural Engine for maximum performance.

# Disable if needed
cmake -B build -DGGML_METAL=OFF

NVIDIA GPUs

CUDABest performance on NVIDIA GPUs. Requires CUDA toolkit.

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

AMD GPUs

HIP (Linux) or Vulkan (cross-platform)HIP provides best performance on Linux. Vulkan works everywhere.

# HIP
cmake -B build -DGGML_HIP=ON

# Vulkan
cmake -B build -DGGML_VULKAN=ON

Intel GPUs

SYCLOptimized for Intel Data Center Max, Flex, Arc, and integrated GPUs.

cmake -B build -DGGML_SYCL=ON
cmake --build build --config Release

CPU Only

BLAS or AVX2/AVX512Use OpenBLAS, Intel MKL, or native CPU features.

# OpenBLAS
cmake -B build -DGGML_BLAS=ON

# Native optimizations (auto-detected)
cmake -B build

Mobile/Embedded

OpenCL (Qualcomm) or Hexagon (Snapdragon DSP)Optimized for mobile SoCs and embedded devices.

Backend Details

Metal (Apple Silicon)

Default backend on macOS - automatically enabled during build.

Features:

Native GPU acceleration on M1/M2/M3/M4 chips
Utilizes Apple Neural Engine when available
Optimized via ARM NEON and Accelerate framework
Zero configuration required

When to use:

Any Mac with Apple Silicon
Provides best performance on macOS

Disable Metal:

cmake -B build -DGGML_METAL=OFF
llama-cli -m model.gguf --n-gpu-layers 0  # Runtime disable

CUDA (NVIDIA GPUs)

Features:

Custom CUDA kernels optimized for LLM inference
Supports all NVIDIA GPUs with compute capability ≥ 3.5
Multi-GPU support with layer splitting
Hybrid CPU+GPU inference for models larger than VRAM

Requirements:

NVIDIA GPU with CUDA support
CUDA Toolkit installed

Build:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Usage:

# Offload all layers to GPU
llama-cli -m model.gguf -ngl 99

# Offload 32 layers to GPU
llama-cli -m model.gguf -ngl 32

# Use specific GPU
llama-cli -m model.gguf -ngl 99 --main-gpu 0

# Split across multiple GPUs
llama-cli -m model.gguf -ngl 99 --tensor-split 3,1

The -ngl (n-gpu-layers) parameter controls how many transformer layers run on GPU. Use -ngl 99 or -ngl -1 to offload all layers.

HIP (AMD GPUs)

Features:

AMD GPU acceleration using HIP runtime
Similar performance to CUDA on AMD hardware
Supports Radeon RX and Instinct series

Requirements:

AMD GPU with ROCm support
ROCm installed (Linux only)

Build:

cmake -B build -DGGML_HIP=ON
cmake --build build --config Release

When to use:

AMD GPUs on Linux
Best performance for AMD hardware

Vulkan (Cross-platform GPU)

Features:

Cross-platform GPU acceleration
Works on NVIDIA, AMD, Intel, and mobile GPUs
Supported on Windows, Linux, macOS, and Android
Good fallback when native backends aren’t available

Build:

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

When to use:

AMD GPUs on Windows
Multi-vendor GPU systems
When CUDA/Metal/HIP aren’t available
Android devices

SYCL (Intel & NVIDIA GPUs)

Features:

Unified programming model for heterogeneous computing
Optimized for Intel Data Center Max, Flex, and Arc GPUs
Also supports NVIDIA GPUs
CPU+GPU hybrid inference

Supported Hardware:

Intel Data Center GPU Max Series
Intel Data Center GPU Flex Series
Intel Arc Graphics (A-Series)
Intel integrated GPUs
NVIDIA GPUs (via CUDA backend)

Build:

# Intel GPUs
source /opt/intel/oneapi/setvars.sh
cmake -B build -DGGML_SYCL=ON
cmake --build build --config Release

# NVIDIA GPUs via SYCL
cmake -B build -DGGML_SYCL=ON -DGGML_SYCL_TARGET=NVIDIA
cmake --build build --config Release

See the for detailed setup instructions.

BLAS (CPU Acceleration)

Features:

Accelerates matrix operations on CPU
Improves prompt processing with large batch sizes
Multiple implementations available

Implementations:

OpenBLAS (Recommended for most users)

# Install OpenBLAS
sudo apt-get install libopenblas-dev  # Debian/Ubuntu
sudo dnf install openblas-devel       # Fedora/RHEL

# Build
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release

Intel oneMKL (Best for Intel CPUs)

# Source Intel environment
source /opt/intel/oneapi/setvars.sh

# Build
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp \
      -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build build --config Release

Enables AVX-VNNI instructions on Intel CPUs without AVX-512.

Apple Accelerate (macOS default)

Automatically enabled on macOS. Provides BLAS using Apple’s Accelerate framework.

BLAS acceleration primarily improves prompt processing speed. It has minimal effect on token generation speed.

CPU Native Optimizations

Even without BLAS, llama.cpp includes extensive CPU optimizations: x86_64 CPUs:

AVX, AVX2, AVX512 support (auto-detected)
AMX (Advanced Matrix Extensions) for Intel Sapphire Rapids
FMA (Fused Multiply-Add)

ARM CPUs:

NEON SIMD instructions
SVE (Scalable Vector Extension)
ARM v8.2+ fp16 support

RISC-V CPUs:

RVV (RISC-V Vector Extension)
ZVFH, ZFH (half-precision floating point)
ZICBOP (cache block operations)
ZIHINTPAUSE (pause hint)

Build with native optimizations:

cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release

Hybrid CPU+GPU Inference

For models larger than your GPU VRAM:

# Split model across GPU and CPU
# Model has 80 layers, GPU has VRAM for 60
llama-cli -m model.gguf -ngl 60

# 60 layers on GPU, 20 layers on CPU
# Automatically manages data transfer

llama.cpp seamlessly handles models that don’t fit entirely in VRAM by offloading remaining layers to system RAM.

Multi-GPU Configuration

Layer Split Mode

Distribute model layers across multiple GPUs:

# Split evenly across all GPUs
llama-cli -m model.gguf -ngl 99

# Use GPU 1 as primary
llama-cli -m model.gguf -ngl 99 --main-gpu 1

# Custom split ratio (GPU 0 gets 3x more than GPU 1)
llama-cli -m model.gguf -ngl 99 --tensor-split 3,1

Tensor Parallel Mode

Split individual layers across GPUs (when supported):

llama-cli -m model.gguf -ngl 99 --split-mode row

Tensor parallelism requires backend support and may not be available for all model architectures.

Choosing the Right Backend

For Maximum Performance

Apple Silicon: Metal (default)
NVIDIA GPU: CUDA
AMD GPU Linux: HIP
AMD GPU Windows: Vulkan
Intel GPU: SYCL
CPU only: BLAS (OpenBLAS or Intel MKL)

For Maximum Compatibility

Cross-platform GPU: Vulkan
CPU: Built-in optimizations (AVX2/AVX512/NEON)

For Mobile/Embedded

Qualcomm Snapdragon: OpenCL or Hexagon DSP
Android: Vulkan
ARM devices: NEON optimizations (automatic)

Performance Tips

Quantization

Use Q4_K_M or Q5_K_M quantization to fit larger models in GPU VRAM.See Quantization Guide

Batch Size

Larger batch sizes better utilize GPU parallelism.

llama-cli -m model.gguf -ngl 99 -b 512

Flash Attention

Enable flash attention for faster inference (GPU only).

llama-cli -m model.gguf -fa

Context Size

Reduce context size if running out of VRAM.

llama-cli -m model.gguf -c 2048

Troubleshooting

Out of Memory Errors

Reduce -ngl to offload fewer layers
Use more aggressive quantization (Q3_K_M or Q4_K_M)
Reduce context size with -c
Enable memory mapping with --mmap

Slow Performance

Verify GPU is actually being used (check nvidia-smi or Activity Monitor)
Increase -ngl to offload more layers to GPU
Increase batch size with -b
Check that GPU drivers are up to date
Use llama-bench to measure performance

Build Failures

Ensure SDK/toolkit is properly installed (CUDA Toolkit, ROCm, etc.)
Check environment variables (CUDA_PATH, ROCM_PATH)
Update CMake to latest version
See backend-specific documentation for detailed setup

Benchmarking

Measure performance on your hardware:

# Benchmark model
llama-bench -m model.gguf

# Output:
# | model         | size    | params | backend    | ngl | test  | t/s      |
# | ------------- | ------- | ------ | ---------- | --- | ----- | -------- |
# | llama 7B Q4_0 | 3.56 GB | 6.74 B | CUDA       | 99  | pp512 | 1234.56  |
# | llama 7B Q4_0 | 3.56 GB | 6.74 B | CUDA       | 99  | tg128 | 45.67    |

Get Started

Core Concepts

Inference

Models

Advanced

Compute Backends

Compute Backends

Supported Backends

Quick Selection Guide

Apple Silicon Macs

NVIDIA GPUs

AMD GPUs

Intel GPUs

CPU Only

Mobile/Embedded

Backend Details

Metal (Apple Silicon)

CUDA (NVIDIA GPUs)

HIP (AMD GPUs)

Vulkan (Cross-platform GPU)

SYCL (Intel & NVIDIA GPUs)

BLAS (CPU Acceleration)

CPU Native Optimizations

Hybrid CPU+GPU Inference

Multi-GPU Configuration

Layer Split Mode

Tensor Parallel Mode

Choosing the Right Backend

For Maximum Performance

For Maximum Compatibility

For Mobile/Embedded

Performance Tips

Quantization

Batch Size

Flash Attention

Context Size

Troubleshooting

Benchmarking

Further Reading

Get Started

Core Concepts

Inference

Models

Advanced

​Compute Backends

​Supported Backends

​Quick Selection Guide

Apple Silicon Macs

NVIDIA GPUs

AMD GPUs

Intel GPUs

CPU Only

Mobile/Embedded

​Backend Details

​Metal (Apple Silicon)

​CUDA (NVIDIA GPUs)

​HIP (AMD GPUs)

​Vulkan (Cross-platform GPU)

​SYCL (Intel & NVIDIA GPUs)

​BLAS (CPU Acceleration)

​CPU Native Optimizations

​Hybrid CPU+GPU Inference

​Multi-GPU Configuration

​Layer Split Mode

​Tensor Parallel Mode

​Choosing the Right Backend

​For Maximum Performance

​For Maximum Compatibility

​For Mobile/Embedded

​Performance Tips

Quantization

Batch Size

Flash Attention

Context Size

​Troubleshooting

​Benchmarking

​Further Reading

Compute Backends

Supported Backends

Quick Selection Guide

Backend Details

Metal (Apple Silicon)

CUDA (NVIDIA GPUs)

HIP (AMD GPUs)

Vulkan (Cross-platform GPU)

SYCL (Intel & NVIDIA GPUs)

BLAS (CPU Acceleration)

CPU Native Optimizations

Hybrid CPU+GPU Inference

Multi-GPU Configuration

Layer Split Mode

Tensor Parallel Mode

Choosing the Right Backend

For Maximum Performance

For Maximum Compatibility

For Mobile/Embedded

Performance Tips

Troubleshooting

Benchmarking

Further Reading