Skip to main content
Open Source · MIT License

Tensor library for machine learning

ggml is a low-level C library for tensor operations and machine learning inference. It powers llama.cpp, whisper.cpp, and many other high-performance ML runtimes — with zero third-party dependencies.

Quickstart

Build from source and run your first ggml computation graph in minutes.

Core Concepts

Understand tensors, computation graphs, and automatic differentiation.

Backends

Run on CPU, CUDA, Metal, Vulkan, OpenCL, SYCL, WebGPU, or remotely via RPC.

API Reference

Full C API reference for context, tensors, operations, and graph execution.

Why ggml?

ggml was built to enable efficient machine learning inference on consumer hardware. It is the foundation of llama.cpp and whisper.cpp — enabling billions of people to run large language models locally.

No dependencies

Pure C/C++ with zero third-party library requirements. Integrate into any project.

Zero runtime allocs

Memory is pre-allocated at initialization. No heap allocations during computation.

Integer quantization

Q2 through Q8 quantization formats for dramatically reduced model size and memory use.

Auto-differentiation

Define computation graphs once; compute forward and backward passes automatically.

Multi-backend

Transparent dispatch to CPU, CUDA, Metal, Vulkan, and more via the backend scheduler.

GGUF format

Efficient binary file format for storing and loading quantized models with metadata.

Key capabilities

ggml defines a computation graph API where tensor operations are recorded, then executed in bulk. This enables:
  • Efficient memory reuse across forward passes
  • Hardware-accelerated execution through pluggable backends
  • Integer quantization for reduced memory bandwidth
  • Flash attention and other fused operations for transformer models
// ggml.h + ggml-cpu.h
struct ggml_init_params params = {
    .mem_size   = 256*1024*1024,
    .mem_buffer = NULL,
};
struct ggml_context * ctx = ggml_init(params);

struct ggml_tensor * a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4096, 4096);
struct ggml_tensor * b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4096, 4096);
struct ggml_tensor * c = ggml_mul_mat(ctx, a, b);

struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, c);
ggml_graph_compute_with_ctx(ctx, gf, /*n_threads=*/4);  // from ggml-cpu.h

ggml_free(ctx);

Hardware support

ggml runs on a wide range of hardware through its pluggable backend system:
BackendDevicesNotes
CPUx86, ARM, RISC-V, PowerPCSIMD via AVX/AVX2/AVX-512/NEON/SVE
CUDANVIDIA GPUsRequires CUDA toolkit
MetalApple Silicon, AMD GPUs (macOS)Best performance on Apple devices
VulkanCross-vendor GPU supportLinux, Windows, Android
OpenCLAMD, Intel, Qualcomm GPUs
SYCLIntel GPUs, oneAPI
WebGPUBrowser / WASM
RPCRemote devicesDistribute computation across machines

Used by

ggml powers some of the most popular open-source ML projects:
ggml is under active development. Much of the ongoing development happens in the llama.cpp and whisper.cpp repositories before being upstreamed.

Build docs developers (and LLMs) love