Open Source · MIT License
Tensor library for machine learning
ggml is a low-level C library for tensor operations and machine learning inference. It powers llama.cpp, whisper.cpp, and many other high-performance ML runtimes — with zero third-party dependencies.
Quickstart
Build from source and run your first ggml computation graph in minutes.
Core Concepts
Understand tensors, computation graphs, and automatic differentiation.
Backends
Run on CPU, CUDA, Metal, Vulkan, OpenCL, SYCL, WebGPU, or remotely via RPC.
API Reference
Full C API reference for context, tensors, operations, and graph execution.
Why ggml?
ggml was built to enable efficient machine learning inference on consumer hardware. It is the foundation of llama.cpp and whisper.cpp — enabling billions of people to run large language models locally.No dependencies
Pure C/C++ with zero third-party library requirements. Integrate into any project.
Zero runtime allocs
Memory is pre-allocated at initialization. No heap allocations during computation.
Integer quantization
Q2 through Q8 quantization formats for dramatically reduced model size and memory use.
Auto-differentiation
Define computation graphs once; compute forward and backward passes automatically.
Multi-backend
Transparent dispatch to CPU, CUDA, Metal, Vulkan, and more via the backend scheduler.
GGUF format
Efficient binary file format for storing and loading quantized models with metadata.
Key capabilities
- Inference
- Training
- Quantization
ggml defines a computation graph API where tensor operations are recorded, then executed in bulk. This enables:
- Efficient memory reuse across forward passes
- Hardware-accelerated execution through pluggable backends
- Integer quantization for reduced memory bandwidth
- Flash attention and other fused operations for transformer models
Hardware support
ggml runs on a wide range of hardware through its pluggable backend system:| Backend | Devices | Notes |
|---|---|---|
| CPU | x86, ARM, RISC-V, PowerPC | SIMD via AVX/AVX2/AVX-512/NEON/SVE |
| CUDA | NVIDIA GPUs | Requires CUDA toolkit |
| Metal | Apple Silicon, AMD GPUs (macOS) | Best performance on Apple devices |
| Vulkan | Cross-vendor GPU support | Linux, Windows, Android |
| OpenCL | AMD, Intel, Qualcomm GPUs | |
| SYCL | Intel GPUs, oneAPI | |
| WebGPU | Browser / WASM | |
| RPC | Remote devices | Distribute computation across machines |
Used by
ggml powers some of the most popular open-source ML projects:- llama.cpp — Run LLaMA, Mistral, Qwen, and hundreds of other LLMs locally
- whisper.cpp — Real-time speech recognition
- stable-diffusion.cpp — Stable Diffusion image generation
ggml is under active development. Much of the ongoing development happens in the llama.cpp and whisper.cpp repositories before being upstreamed.
