Welcome to llama.cpp

llama.cpp enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

Get Started

Run your first LLM in minutes

API Reference

Explore the C/C++ API

Model Support

Browse 100+ supported models

Build Guide

Compile from source

Why llama.cpp?

Minimal Dependencies

Plain C/C++ implementation with no external dependencies

Multi-Platform

Runs on CPU, GPU, and NPU across x86, ARM, and mobile platforms

Quantization

1.5-bit to 8-bit integer quantization for reduced memory and faster inference

Hardware Acceleration

Optimized kernels for Metal, CUDA, Vulkan, SYCL, and more

Key Features

Apple Silicon First-Class Support — Optimized via ARM NEON, Accelerate, and Metal frameworks
x86 Acceleration — AVX, AVX2, AVX512, and AMX support
GPU Support — Custom CUDA kernels for NVIDIA, HIP for AMD, MUSA for Moore Threads
OpenAI-Compatible Server — Drop-in replacement with /v1/chat/completions endpoint
Multimodal Models — Support for vision models like LLaVA and Qwen2-VL
Speculative Decoding — Parallel decoding for improved throughput
CPU+GPU Hybrid Inference — Partially accelerate models larger than VRAM capacity

Quick Example

# Download and run a model from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Or use a local GGUF file
llama-cli -m my_model.gguf -p "Hello, world!"

Supported Models

llama.cpp supports 100+ model architectures including:

LLaMA — LLaMA, LLaMA 2, LLaMA 3, and fine-tunes
Mistral — Mistral 7B, Mixtral MoE, and variants
Gemma — Google Gemma and Gemma 2 models
Qwen — Qwen and Qwen2 series
Phi — Microsoft Phi models and PhiMoE
Vision Models — LLaVA, Qwen2-VL, MiniCPM, and more

View full model support →

Performance

llama.cpp is optimized for inference across diverse hardware:

Backend	Target Platform
Metal	Apple Silicon (M1/M2/M3/M4)
CUDA	NVIDIA GPUs
HIP	AMD GPUs
Vulkan	Cross-platform GPU
SYCL	Intel GPU and Nvidia GPU
CANN	Ascend NPU
OpenCL	Adreno GPU

Community & Ecosystem

llama.cpp powers a vibrant ecosystem of tools, UIs, and language bindings:

Language Bindings — Python, Node.js, Rust, Go, C#, Java, Swift, and more
UIs — LM Studio, ollama, text-generation-webui, Jan, and dozens more
Tools — Model conversion, quantization, benchmarking, and evaluation

Next Steps

Quickstart Guide

Run your first inference in 5 minutes

Installation

Install via package manager or build from source

Core Concepts

Understand GGUF format and quantization

REST API

Use the OpenAI-compatible HTTP server

Quick Start

⌘I

Get Started

Core Concepts

Inference

Models

Advanced

Introduction

Welcome to llama.cpp

Get Started

API Reference

Model Support

Build Guide

Why llama.cpp?

Minimal Dependencies

Multi-Platform

Quantization

Hardware Acceleration

Key Features

Quick Example

Supported Models

Performance

Community & Ecosystem

Next Steps

Quickstart Guide

Installation

Core Concepts

REST API

Get Started

Core Concepts

Inference

Models

Advanced

​Welcome to llama.cpp

Get Started

API Reference

Model Support

Build Guide

​Why llama.cpp?

Minimal Dependencies

Multi-Platform

Quantization

Hardware Acceleration

​Key Features

​Quick Example

​Supported Models

​Performance

​Community & Ecosystem

​Next Steps

Quickstart Guide

Installation

Core Concepts

REST API

Welcome to llama.cpp

Why llama.cpp?

Key Features

Quick Example

Supported Models

Performance

Community & Ecosystem

Next Steps