Skip to main content

Welcome to llama.cpp

llama.cpp enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

Get Started

Run your first LLM in minutes

API Reference

Explore the C/C++ API

Model Support

Browse 100+ supported models

Build Guide

Compile from source

Why llama.cpp?

Minimal Dependencies

Plain C/C++ implementation with no external dependencies

Multi-Platform

Runs on CPU, GPU, and NPU across x86, ARM, and mobile platforms

Quantization

1.5-bit to 8-bit integer quantization for reduced memory and faster inference

Hardware Acceleration

Optimized kernels for Metal, CUDA, Vulkan, SYCL, and more

Key Features

  • Apple Silicon First-Class Support — Optimized via ARM NEON, Accelerate, and Metal frameworks
  • x86 Acceleration — AVX, AVX2, AVX512, and AMX support
  • GPU Support — Custom CUDA kernels for NVIDIA, HIP for AMD, MUSA for Moore Threads
  • OpenAI-Compatible Server — Drop-in replacement with /v1/chat/completions endpoint
  • Multimodal Models — Support for vision models like LLaVA and Qwen2-VL
  • Speculative Decoding — Parallel decoding for improved throughput
  • CPU+GPU Hybrid Inference — Partially accelerate models larger than VRAM capacity

Quick Example

# Download and run a model from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Or use a local GGUF file
llama-cli -m my_model.gguf -p "Hello, world!"

Supported Models

llama.cpp supports 100+ model architectures including:
  • LLaMA — LLaMA, LLaMA 2, LLaMA 3, and fine-tunes
  • Mistral — Mistral 7B, Mixtral MoE, and variants
  • Gemma — Google Gemma and Gemma 2 models
  • Qwen — Qwen and Qwen2 series
  • Phi — Microsoft Phi models and PhiMoE
  • Vision Models — LLaVA, Qwen2-VL, MiniCPM, and more
View full model support →

Performance

llama.cpp is optimized for inference across diverse hardware:
BackendTarget Platform
MetalApple Silicon (M1/M2/M3/M4)
CUDANVIDIA GPUs
HIPAMD GPUs
VulkanCross-platform GPU
SYCLIntel GPU and Nvidia GPU
CANNAscend NPU
OpenCLAdreno GPU

Community & Ecosystem

llama.cpp powers a vibrant ecosystem of tools, UIs, and language bindings:
  • Language Bindings — Python, Node.js, Rust, Go, C#, Java, Swift, and more
  • UIs — LM Studio, ollama, text-generation-webui, Jan, and dozens more
  • Tools — Model conversion, quantization, benchmarking, and evaluation

Next Steps

Quickstart Guide

Run your first inference in 5 minutes

Installation

Install via package manager or build from source

Core Concepts

Understand GGUF format and quantization

REST API

Use the OpenAI-compatible HTTP server