Welcome to llama.cpp
llama.cpp enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.Get Started
Run your first LLM in minutes
API Reference
Explore the C/C++ API
Model Support
Browse 100+ supported models
Build Guide
Compile from source
Why llama.cpp?
Minimal Dependencies
Plain C/C++ implementation with no external dependencies
Multi-Platform
Runs on CPU, GPU, and NPU across x86, ARM, and mobile platforms
Quantization
1.5-bit to 8-bit integer quantization for reduced memory and faster inference
Hardware Acceleration
Optimized kernels for Metal, CUDA, Vulkan, SYCL, and more
Key Features
- Apple Silicon First-Class Support — Optimized via ARM NEON, Accelerate, and Metal frameworks
- x86 Acceleration — AVX, AVX2, AVX512, and AMX support
- GPU Support — Custom CUDA kernels for NVIDIA, HIP for AMD, MUSA for Moore Threads
- OpenAI-Compatible Server — Drop-in replacement with
/v1/chat/completionsendpoint - Multimodal Models — Support for vision models like LLaVA and Qwen2-VL
- Speculative Decoding — Parallel decoding for improved throughput
- CPU+GPU Hybrid Inference — Partially accelerate models larger than VRAM capacity
Quick Example
Supported Models
llama.cpp supports 100+ model architectures including:- LLaMA — LLaMA, LLaMA 2, LLaMA 3, and fine-tunes
- Mistral — Mistral 7B, Mixtral MoE, and variants
- Gemma — Google Gemma and Gemma 2 models
- Qwen — Qwen and Qwen2 series
- Phi — Microsoft Phi models and PhiMoE
- Vision Models — LLaVA, Qwen2-VL, MiniCPM, and more
Performance
llama.cpp is optimized for inference across diverse hardware:| Backend | Target Platform |
|---|---|
| Metal | Apple Silicon (M1/M2/M3/M4) |
| CUDA | NVIDIA GPUs |
| HIP | AMD GPUs |
| Vulkan | Cross-platform GPU |
| SYCL | Intel GPU and Nvidia GPU |
| CANN | Ascend NPU |
| OpenCL | Adreno GPU |
Community & Ecosystem
llama.cpp powers a vibrant ecosystem of tools, UIs, and language bindings:- Language Bindings — Python, Node.js, Rust, Go, C#, Java, Swift, and more
- UIs — LM Studio, ollama, text-generation-webui, Jan, and dozens more
- Tools — Model conversion, quantization, benchmarking, and evaluation
Next Steps
Quickstart Guide
Run your first inference in 5 minutes
Installation
Install via package manager or build from source
Core Concepts
Understand GGUF format and quantization
REST API
Use the OpenAI-compatible HTTP server

