CPU backend

The CPU backend is ggml’s built-in execution target. It requires no external dependencies, works on every supported platform, and is always available as a fallback when no GPU backend is present.

Initialization

#include "ggml-cpu.h"

ggml_backend_t backend = ggml_backend_cpu_init();
if (!backend) {
    fprintf(stderr, "failed to initialize CPU backend\n");
    return 1;
}

You can also use the generic backend selector, which returns the CPU backend when no GPU is found:

// Returns the best GPU, or CPU if none is available
ggml_backend_t backend = ggml_backend_init_best();

// Always returns the CPU backend
ggml_backend_t cpu = ggml_backend_init_by_type(GGML_BACKEND_DEVICE_TYPE_CPU, NULL);

Call ggml_backend_load_all() before using ggml_backend_init_best() or ggml_backend_init_by_type() so that all compiled-in backends are registered.

Thread configuration

The CPU backend parallelises operations across threads. You control the thread count after initialization:

// Set the number of threads for graph compute
ggml_backend_cpu_set_n_threads(backend, 8);

Custom thread pool

For finer control — including thread affinity and NUMA-awareness — create a ggml_threadpool and attach it:

#include "ggml-cpu.h"

struct ggml_threadpool_params tp_params = ggml_threadpool_params_default(8);
struct ggml_threadpool * pool = ggml_threadpool_new(&tp_params);

ggml_backend_cpu_set_threadpool(backend, pool);

// When done:
ggml_threadpool_free(pool);

Thread pool management functions:

Function	Description
`ggml_threadpool_new(params)`	Create a thread pool with the given parameters
`ggml_threadpool_free(pool)`	Destroy the thread pool
`ggml_threadpool_get_n_threads(pool)`	Query the thread count
`ggml_threadpool_pause(pool)`	Suspend worker threads
`ggml_threadpool_resume(pool)`	Resume suspended threads

NUMA support

On systems with multiple NUMA nodes, initialise ggml’s NUMA support before creating backends:

// Choose a strategy appropriate for your system
ggml_numa_init(GGML_NUMA_STRATEGY_DISTRIBUTE);

Strategy	Description
`GGML_NUMA_STRATEGY_DISABLED`	No NUMA awareness (default)
`GGML_NUMA_STRATEGY_DISTRIBUTE`	Distribute threads across nodes
`GGML_NUMA_STRATEGY_ISOLATE`	Pin all threads to one node
`GGML_NUMA_STRATEGY_NUMACTL`	Honour `numactl` binding from the shell
`GGML_NUMA_STRATEGY_MIRROR`	Mirror allocation across nodes

SIMD optimisations

ggml detects CPU features at runtime and selects the most capable implementation for each operation. You can query which extensions are available:

x86
ARM
Other

// Returns 1 if the CPU supports the extension, 0 otherwise
ggml_cpu_has_avx()        // AVX
ggml_cpu_has_avx2()       // AVX2
ggml_cpu_has_avx512()     // AVX-512F
ggml_cpu_has_avx512_vnni()// AVX-512 VNNI
ggml_cpu_has_avx512_bf16()// AVX-512 BF16
ggml_cpu_has_avx_vnni()   // AVX-VNNI
ggml_cpu_has_fma()        // FMA3
ggml_cpu_has_f16c()       // F16C (CVT16)
ggml_cpu_has_amx_int8()   // Intel AMX INT8
ggml_cpu_has_bmi2()       // BMI2

ggml_cpu_has_neon()       // NEON SIMD
ggml_cpu_has_arm_fma()    // ARM FMA
ggml_cpu_has_dotprod()    // SDOT/UDOT dot-product
ggml_cpu_has_matmul_int8()// SMMLA/UMMLA int8 matmul
ggml_cpu_has_sve()        // Scalable Vector Extension
ggml_cpu_get_sve_cnt()    // SVE vector length in bytes
ggml_cpu_has_sme()        // Scalable Matrix Extension
ggml_cpu_has_fp16_va()    // FP16 vector arithmetic

ggml_cpu_has_riscv_v()    // RISC-V Vector Extension
ggml_cpu_get_rvv_vlen()   // RVV vector length in bytes
ggml_cpu_has_vsx()        // PowerPC VSX
ggml_cpu_has_vxe()        // IBM z Vector Extensions
ggml_cpu_has_wasm_simd()  // WebAssembly SIMD

You do not need to call these functions to get SIMD acceleration — ggml selects the best path automatically. Use them only if you need to log or assert specific capabilities.

Abort callback

You can register a callback that the CPU backend will call periodically during graph compute. Return true to abort execution:

bool my_abort(void * data) {
    return should_cancel; // return true to stop computation
}

ggml_backend_cpu_set_abort_callback(backend, my_abort, NULL);

Reference implementations

For debugging or correctness testing, force the backend to use unoptimised scalar code:

ggml_backend_cpu_set_use_ref(backend, true);

Build configuration

The CPU backend is compiled into ggml unconditionally. No additional CMake flags are required. SIMD paths are enabled automatically when the target compiler supports them.

cmake -B build
cmake --build build

To target a specific architecture on x86:

# Enable AVX2 and FMA explicitly
target_compile_options(ggml PRIVATE -mavx2 -mfma)

API summary

Function	Description
`ggml_backend_cpu_init()`	Create a CPU backend instance
`ggml_backend_is_cpu(backend)`	Check whether a backend is the CPU backend
`ggml_backend_cpu_set_n_threads(backend, n)`	Set the thread count
`ggml_backend_cpu_set_threadpool(backend, pool)`	Attach a custom thread pool
`ggml_backend_cpu_set_abort_callback(backend, cb, data)`	Register an abort callback
`ggml_backend_cpu_set_use_ref(backend, use_ref)`	Force reference (scalar) implementations
`ggml_backend_cpu_reg()`	Return the CPU backend registry entry

Get Started

Core Concepts

Backends

Training

File Formats

Examples

CPU backend

Initialization

Thread configuration

Custom thread pool

NUMA support

SIMD optimisations

Abort callback

Reference implementations

Build configuration

API summary

Build docs developers (and LLMs) love

Get Started

Core Concepts

Backends

Training

File Formats

Examples

​Initialization

​Thread configuration

​Custom thread pool

​NUMA support

​SIMD optimisations

​Abort callback

​Reference implementations

​Build configuration

​API summary

Build docs developers (and LLMs) love

Initialization

Thread configuration

Custom thread pool

NUMA support

SIMD optimisations

Abort callback

Reference implementations

Build configuration

API summary