Open Source · MIT License

Tensor library for machine learning

ggml is a low-level C library for tensor operations and machine learning inference. It powers llama.cpp, whisper.cpp, and many other high-performance ML runtimes — with zero third-party dependencies.

Get Started →View on GitHub

Quickstart

Build from source and run your first ggml computation graph in minutes.

Core Concepts

Understand tensors, computation graphs, and automatic differentiation.

Backends

Run on CPU, CUDA, Metal, Vulkan, OpenCL, SYCL, WebGPU, or remotely via RPC.

API Reference

Full C API reference for context, tensors, operations, and graph execution.

Why ggml?

ggml was built to enable efficient machine learning inference on consumer hardware. It is the foundation of llama.cpp and whisper.cpp — enabling billions of people to run large language models locally.

No dependencies

Pure C/C++ with zero third-party library requirements. Integrate into any project.

Zero runtime allocs

Memory is pre-allocated at initialization. No heap allocations during computation.

Integer quantization

Q2 through Q8 quantization formats for dramatically reduced model size and memory use.

Auto-differentiation

Define computation graphs once; compute forward and backward passes automatically.

Multi-backend

Transparent dispatch to CPU, CUDA, Metal, Vulkan, and more via the backend scheduler.

GGUF format

Efficient binary file format for storing and loading quantized models with metadata.

Key capabilities

Inference
Training
Quantization

ggml defines a computation graph API where tensor operations are recorded, then executed in bulk. This enables:

Efficient memory reuse across forward passes
Hardware-accelerated execution through pluggable backends
Integer quantization for reduced memory bandwidth
Flash attention and other fused operations for transformer models

// ggml.h + ggml-cpu.h
struct ggml_init_params params = {
    .mem_size   = 256*1024*1024,
    .mem_buffer = NULL,
};
struct ggml_context * ctx = ggml_init(params);

struct ggml_tensor * a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4096, 4096);
struct ggml_tensor * b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4096, 4096);
struct ggml_tensor * c = ggml_mul_mat(ctx, a, b);

struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, c);
ggml_graph_compute_with_ctx(ctx, gf, /*n_threads=*/4);  // from ggml-cpu.h

ggml_free(ctx);

ggml supports training via automatic differentiation and built-in optimizers:

AdamW and SGD optimizers
Cross-entropy and mean-squared-error loss functions
Dataset management with shuffling and batching
High-level ggml_opt_fit for common training loops

// Create optimizer context
ggml_backend_sched_t sched = ...; // set up backend scheduler
struct ggml_opt_params opt_params = ggml_opt_default_params(sched, GGML_OPT_LOSS_TYPE_CROSS_ENTROPY);
opt_params.optimizer = GGML_OPT_OPTIMIZER_TYPE_ADAMW;

ggml_opt_context_t opt_ctx = ggml_opt_init(opt_params);

// Fit model to dataset
ggml_opt_fit(sched, ctx_compute, inputs, outputs, dataset,
             GGML_OPT_LOSS_TYPE_CROSS_ENTROPY,
             GGML_OPT_OPTIMIZER_TYPE_ADAMW,
             ggml_opt_get_default_optimizer_params,
             /*nepoch=*/10, /*nbatch_logical=*/32,
             /*val_split=*/0.1f, /*silent=*/false);

ggml natively supports storing and computing with quantized weights:

2-bit through 8-bit integer quantization
Mixed-precision computation (quantized weights, FP32/FP16 activations)
Hardware-accelerated dequantization on CUDA, Metal, and CPU with SIMD

// Quantize a buffer of f32 weights to Q4_K
size_t quantized_size = ggml_quantize_chunk(
    GGML_TYPE_Q4_K,     // target quantization type
    src_data,            // source f32 data
    dst_data,            // destination quantized buffer
    0,                   // starting row (start)
    nrows,               // number of rows
    n_per_row,           // elements per row
    /*imatrix=*/NULL     // importance matrix (optional)
);

Hardware support

ggml runs on a wide range of hardware through its pluggable backend system:

Backend	Devices	Notes
CPU	x86, ARM, RISC-V, PowerPC	SIMD via AVX/AVX2/AVX-512/NEON/SVE
CUDA	NVIDIA GPUs	Requires CUDA toolkit
Metal	Apple Silicon, AMD GPUs (macOS)	Best performance on Apple devices
Vulkan	Cross-vendor GPU support	Linux, Windows, Android
OpenCL	AMD, Intel, Qualcomm GPUs
SYCL	Intel GPUs, oneAPI
WebGPU	Browser / WASM
RPC	Remote devices	Distribute computation across machines

Used by

ggml powers some of the most popular open-source ML projects:

llama.cpp — Run LLaMA, Mistral, Qwen, and hundreds of other LLMs locally
whisper.cpp — Real-time speech recognition
stable-diffusion.cpp — Stable Diffusion image generation

ggml is under active development. Much of the ongoing development happens in the llama.cpp and whisper.cpp repositories before being upstreamed.

Get Started

Core Concepts

Backends

Training

File Formats

Examples

ggml

Tensor library for machine learning

Quickstart

Core Concepts

Backends

API Reference

Why ggml?

No dependencies

Zero runtime allocs

Integer quantization

Auto-differentiation

Multi-backend

GGUF format

Key capabilities

Hardware support

Used by

Build docs developers (and LLMs) love

Get Started

Core Concepts

Backends

Training

File Formats

Examples

​Tensor library for machine learning

Quickstart

Core Concepts

Backends

API Reference

​Why ggml?

No dependencies

Zero runtime allocs

Integer quantization

Auto-differentiation

Multi-backend

GGUF format

​Key capabilities

​Hardware support

​Used by

Build docs developers (and LLMs) love

Tensor library for machine learning

Why ggml?

Key capabilities

Hardware support

Used by