Compute Backends
llama.cpp supports a wide variety of compute backends to accelerate inference across different hardware platforms. Each backend is optimized for specific hardware architectures.Supported Backends
| Backend | Target Devices | Documentation |
|---|---|---|
| Metal | Apple Silicon (M1/M2/M3/M4) | Build Guide |
| CUDA | NVIDIA GPUs | Build Guide |
| HIP | AMD GPUs | Build Guide |
| Vulkan | Cross-platform GPUs | Build Guide |
| SYCL | Intel GPUs & Nvidia GPUs | |
| CANN | Ascend NPUs (Huawei) | |
| MUSA | Moore Threads GPUs | Build Guide |
| OpenCL | Adreno GPUs (Qualcomm) | |
| BLAS | CPU acceleration (all platforms) | Build Guide |
| BLIS | CPU acceleration (AMD optimized) | |
| ZenDNN | AMD CPUs | |
| zDNN | IBM Z & LinuxONE | |
| RPC | Remote compute | RPC Documentation |
| VirtGPU | VirtGPU API | |
| Hexagon | Snapdragon DSP | |
| WebGPU | Browser/WASM | Build Guide (In Progress) |
Quick Selection Guide
Apple Silicon Macs
Metal (default)Automatically enabled on macOS. Uses GPU and Neural Engine for maximum performance.
NVIDIA GPUs
CUDABest performance on NVIDIA GPUs. Requires CUDA toolkit.
AMD GPUs
HIP (Linux) or Vulkan (cross-platform)HIP provides best performance on Linux. Vulkan works everywhere.
Intel GPUs
SYCLOptimized for Intel Data Center Max, Flex, Arc, and integrated GPUs.
CPU Only
BLAS or AVX2/AVX512Use OpenBLAS, Intel MKL, or native CPU features.
Mobile/Embedded
OpenCL (Qualcomm) or Hexagon (Snapdragon DSP)Optimized for mobile SoCs and embedded devices.
Backend Details
Metal (Apple Silicon)
Default backend on macOS - automatically enabled during build.
- Native GPU acceleration on M1/M2/M3/M4 chips
- Utilizes Apple Neural Engine when available
- Optimized via ARM NEON and Accelerate framework
- Zero configuration required
- Any Mac with Apple Silicon
- Provides best performance on macOS
CUDA (NVIDIA GPUs)
Features:- Custom CUDA kernels optimized for LLM inference
- Supports all NVIDIA GPUs with compute capability ≥ 3.5
- Multi-GPU support with layer splitting
- Hybrid CPU+GPU inference for models larger than VRAM
- NVIDIA GPU with CUDA support
- CUDA Toolkit installed
The
-ngl (n-gpu-layers) parameter controls how many transformer layers run on GPU. Use -ngl 99 or -ngl -1 to offload all layers.HIP (AMD GPUs)
Features:- AMD GPU acceleration using HIP runtime
- Similar performance to CUDA on AMD hardware
- Supports Radeon RX and Instinct series
- AMD GPU with ROCm support
- ROCm installed (Linux only)
- AMD GPUs on Linux
- Best performance for AMD hardware
Vulkan (Cross-platform GPU)
Features:- Cross-platform GPU acceleration
- Works on NVIDIA, AMD, Intel, and mobile GPUs
- Supported on Windows, Linux, macOS, and Android
- Good fallback when native backends aren’t available
- AMD GPUs on Windows
- Multi-vendor GPU systems
- When CUDA/Metal/HIP aren’t available
- Android devices
SYCL (Intel & NVIDIA GPUs)
Features:- Unified programming model for heterogeneous computing
- Optimized for Intel Data Center Max, Flex, and Arc GPUs
- Also supports NVIDIA GPUs
- CPU+GPU hybrid inference
- Intel Data Center GPU Max Series
- Intel Data Center GPU Flex Series
- Intel Arc Graphics (A-Series)
- Intel integrated GPUs
- NVIDIA GPUs (via CUDA backend)
BLAS (CPU Acceleration)
Features:- Accelerates matrix operations on CPU
- Improves prompt processing with large batch sizes
- Multiple implementations available
OpenBLAS (Recommended for most users)
OpenBLAS (Recommended for most users)
Intel oneMKL (Best for Intel CPUs)
Intel oneMKL (Best for Intel CPUs)
Apple Accelerate (macOS default)
Apple Accelerate (macOS default)
Automatically enabled on macOS. Provides BLAS using Apple’s Accelerate framework.
BLAS acceleration primarily improves prompt processing speed. It has minimal effect on token generation speed.
CPU Native Optimizations
Even without BLAS, llama.cpp includes extensive CPU optimizations: x86_64 CPUs:- AVX, AVX2, AVX512 support (auto-detected)
- AMX (Advanced Matrix Extensions) for Intel Sapphire Rapids
- FMA (Fused Multiply-Add)
- NEON SIMD instructions
- SVE (Scalable Vector Extension)
- ARM v8.2+ fp16 support
- RVV (RISC-V Vector Extension)
- ZVFH, ZFH (half-precision floating point)
- ZICBOP (cache block operations)
- ZIHINTPAUSE (pause hint)
Hybrid CPU+GPU Inference
For models larger than your GPU VRAM:llama.cpp seamlessly handles models that don’t fit entirely in VRAM by offloading remaining layers to system RAM.
Multi-GPU Configuration
Layer Split Mode
Distribute model layers across multiple GPUs:Tensor Parallel Mode
Split individual layers across GPUs (when supported):Choosing the Right Backend
For Maximum Performance
- Apple Silicon: Metal (default)
- NVIDIA GPU: CUDA
- AMD GPU Linux: HIP
- AMD GPU Windows: Vulkan
- Intel GPU: SYCL
- CPU only: BLAS (OpenBLAS or Intel MKL)
For Maximum Compatibility
- Cross-platform GPU: Vulkan
- CPU: Built-in optimizations (AVX2/AVX512/NEON)
For Mobile/Embedded
- Qualcomm Snapdragon: OpenCL or Hexagon DSP
- Android: Vulkan
- ARM devices: NEON optimizations (automatic)
Performance Tips
Quantization
Use Q4_K_M or Q5_K_M quantization to fit larger models in GPU VRAM.See Quantization Guide
Batch Size
Larger batch sizes better utilize GPU parallelism.
Flash Attention
Enable flash attention for faster inference (GPU only).
Context Size
Reduce context size if running out of VRAM.
Troubleshooting
Out of Memory Errors
Out of Memory Errors
- Reduce
-nglto offload fewer layers - Use more aggressive quantization (Q3_K_M or Q4_K_M)
- Reduce context size with
-c - Enable memory mapping with
--mmap
Slow Performance
Slow Performance
- Verify GPU is actually being used (check
nvidia-smior Activity Monitor) - Increase
-nglto offload more layers to GPU - Increase batch size with
-b - Check that GPU drivers are up to date
- Use
llama-benchto measure performance
Build Failures
Build Failures
- Ensure SDK/toolkit is properly installed (CUDA Toolkit, ROCm, etc.)
- Check environment variables (CUDA_PATH, ROCM_PATH)
- Update CMake to latest version
- See backend-specific documentation for detailed setup

