Backend API

The backend API provides a hardware-agnostic abstraction for executing GGML computation graphs. Backends represent specific hardware devices (CPU, GPU, etc.) and manage memory buffers, tensor data transfer, and graph execution.

Type definitions

typedef struct ggml_backend_buffer_type * ggml_backend_buffer_type_t;
typedef struct ggml_backend_buffer      * ggml_backend_buffer_t;
typedef struct ggml_backend_event       * ggml_backend_event_t;
typedef struct ggml_backend             * ggml_backend_t;
typedef void                            * ggml_backend_graph_plan_t;
typedef struct ggml_backend_reg         * ggml_backend_reg_t;
typedef struct ggml_backend_device      * ggml_backend_dev_t;

Buffer usage enum

enum ggml_backend_buffer_usage {
    GGML_BACKEND_BUFFER_USAGE_ANY     = 0,
    GGML_BACKEND_BUFFER_USAGE_WEIGHTS = 1,
    GGML_BACKEND_BUFFER_USAGE_COMPUTE = 2,
};

Value	Description
`GGML_BACKEND_BUFFER_USAGE_ANY`	General-purpose buffer with no special semantics
`GGML_BACKEND_BUFFER_USAGE_WEIGHTS`	Buffer holds model weights; operations on tensors in this buffer are preferably scheduled to the same backend
`GGML_BACKEND_BUFFER_USAGE_COMPUTE`	Buffer used for intermediate computation tensors

Set weight buffers to GGML_BACKEND_BUFFER_USAGE_WEIGHTS before creating a scheduler. This lets the scheduler co-locate operations with the weights and reduce cross-device copies.

Buffer type API

A buffer type (ggml_backend_buffer_type_t) describes how a backend allocates and manages memory. You use it to create concrete buffers.

ggml_backend_buft_name

Returns the human-readable name of a buffer type.

const char * ggml_backend_buft_name(ggml_backend_buffer_type_t buft);

buft

ggml_backend_buffer_type_t

required

The buffer type to query.

Returns a null-terminated string. The caller must not free it.

ggml_backend_buft_alloc_buffer

Allocates a new backend buffer of the given size.

ggml_backend_buffer_t ggml_backend_buft_alloc_buffer(
    ggml_backend_buffer_type_t buft,
    size_t size);

buft

ggml_backend_buffer_type_t

required

The buffer type that defines where the memory is allocated.

size

size_t

required

Size of the buffer in bytes.

Returns a new buffer, or NULL on failure. Free with ggml_backend_buffer_free.

ggml_backend_buft_get_alignment

Returns the required memory alignment for this buffer type in bytes.

size_t ggml_backend_buft_get_alignment(ggml_backend_buffer_type_t buft);

buft

ggml_backend_buffer_type_t

required

The buffer type to query.

ggml_backend_buft_is_host

Returns true if the buffer type is accessible directly from the host CPU.

bool ggml_backend_buft_is_host(ggml_backend_buffer_type_t buft);

buft

ggml_backend_buffer_type_t

required

The buffer type to query.

Buffer API

A buffer (ggml_backend_buffer_t) is a concrete allocation of device memory. Tensors are assigned into buffers before being used in graph computation.

ggml_backend_buffer_name

Returns the human-readable name of a buffer.

const char * ggml_backend_buffer_name(ggml_backend_buffer_t buffer);

buffer

ggml_backend_buffer_t

required

The buffer to query.

ggml_backend_buffer_free

Frees a backend buffer and releases all memory it holds.

void ggml_backend_buffer_free(ggml_backend_buffer_t buffer);

buffer

ggml_backend_buffer_t

required

The buffer to free. Passing NULL is safe.

ggml_backend_buffer_get_base

Returns a raw pointer to the start of the buffer’s memory region.

void * ggml_backend_buffer_get_base(ggml_backend_buffer_t buffer);

buffer

ggml_backend_buffer_t

required

The buffer to query.

Returns NULL for device buffers not directly accessible from the host.

ggml_backend_buffer_get_size

Returns the total size of the buffer in bytes.

size_t ggml_backend_buffer_get_size(ggml_backend_buffer_t buffer);

buffer

ggml_backend_buffer_t

required

The buffer to query.

ggml_backend_buffer_clear

Fills the entire buffer with a constant byte value.

void ggml_backend_buffer_clear(ggml_backend_buffer_t buffer, uint8_t value);

buffer

ggml_backend_buffer_t

required

The buffer to clear.

value

uint8_t

required

The byte value to fill every byte of the buffer with.

ggml_backend_buffer_is_host

Returns true if the buffer is in host-accessible memory.

bool ggml_backend_buffer_is_host(ggml_backend_buffer_t buffer);

buffer

ggml_backend_buffer_t

required

The buffer to query.

Backend (stream) API

A backend (ggml_backend_t) represents a compute stream on a device. Most programs create one backend per device and use it throughout the session.

ggml_backend_guid

Returns a globally unique identifier for the backend instance.

ggml_guid_t ggml_backend_guid(ggml_backend_t backend);

backend

ggml_backend_t

required

The backend to query.

ggml_backend_name

Returns the human-readable name of the backend.

const char * ggml_backend_name(ggml_backend_t backend);

backend

ggml_backend_t

required

The backend to query.

ggml_backend_free

Destroys the backend and releases its resources.

void ggml_backend_free(ggml_backend_t backend);

backend

ggml_backend_t

required

The backend to free.

ggml_backend_get_default_buffer_type

Returns the default buffer type for this backend. Use this type when allocating buffers without a specific device preference.

ggml_backend_buffer_type_t ggml_backend_get_default_buffer_type(
    ggml_backend_t backend);

backend

ggml_backend_t

required

The backend to query.

ggml_backend_alloc_buffer

Allocates a buffer of the given size using the backend’s default buffer type.

ggml_backend_buffer_t ggml_backend_alloc_buffer(
    ggml_backend_t backend,
    size_t         size);

backend

ggml_backend_t

required

The backend that owns the allocation.

size

size_t

required

Size in bytes.

Equivalent to calling ggml_backend_buft_alloc_buffer with ggml_backend_get_default_buffer_type(backend).

Tensor operations

ggml_backend_tensor_set

Copies data from a host buffer into a tensor (synchronous).

void ggml_backend_tensor_set(
    struct ggml_tensor * tensor,
    const void         * data,
    size_t               offset,
    size_t               size);

tensor

struct ggml_tensor *

required

Destination tensor.

data

const void *

required

Source data in host memory.

offset

size_t

required

Byte offset into tensor->data at which to start writing.

size

size_t

required

Number of bytes to copy.

ggml_backend_tensor_get

Copies data from a tensor into a host buffer (synchronous).

void ggml_backend_tensor_get(
    const struct ggml_tensor * tensor,
    void                     * data,
    size_t                     offset,
    size_t                     size);

tensor

const struct ggml_tensor *

required

Source tensor.

data

void *

required

Destination buffer in host memory.

offset

size_t

required

Byte offset into tensor->data at which to start reading.

size

size_t

required

Number of bytes to copy.

ggml_backend_tensor_copy

Copies tensor data between two backends. Either or both may be device backends.

void ggml_backend_tensor_copy(
    struct ggml_tensor * src,
    struct ggml_tensor * dst);

src

struct ggml_tensor *

required

Source tensor (can reside on any backend).

dst

struct ggml_tensor *

required

Destination tensor (can reside on any backend).

The source and destination shapes and types must match.

Graph computation

ggml_backend_graph_plan_create

Creates a reusable execution plan for a computation graph. Plans can be executed multiple times without re-analyzing the graph structure.

ggml_backend_graph_plan_t ggml_backend_graph_plan_create(
    ggml_backend_t       backend,
    struct ggml_cgraph * cgraph);

backend

ggml_backend_t

required

The backend that will execute the plan.

cgraph

struct ggml_cgraph *

required

The computation graph to plan.

Free the plan with ggml_backend_graph_plan_free when done.

ggml_backend_graph_plan_compute

Executes a previously created graph plan.

enum ggml_status ggml_backend_graph_plan_compute(
    ggml_backend_t            backend,
    ggml_backend_graph_plan_t plan);

backend

ggml_backend_t

required

The backend that owns the plan.

plan

ggml_backend_graph_plan_t

required

The plan to execute.

Returns GGML_STATUS_SUCCESS on success.

ggml_backend_graph_compute

Executes a computation graph directly, without a pre-created plan.

enum ggml_status ggml_backend_graph_compute(
    ggml_backend_t       backend,
    struct ggml_cgraph * cgraph);

backend

ggml_backend_t

required

The backend to run the graph on.

cgraph

struct ggml_cgraph *

required

The computation graph to execute.

Returns GGML_STATUS_SUCCESS on success. For repeated execution of the same graph topology, prefer creating a plan with ggml_backend_graph_plan_create.

Synchronization

ggml_backend_event_new

Creates a new synchronization event on the given device.

ggml_backend_event_t ggml_backend_event_new(ggml_backend_dev_t device);

device

ggml_backend_dev_t

required

The device that will record and wait on the event.

Free with ggml_backend_event_free.

ggml_backend_event_free

Destroys a synchronization event.

void ggml_backend_event_free(ggml_backend_event_t event);

event

ggml_backend_event_t

required

The event to destroy.

ggml_backend_event_synchronize

Blocks the calling thread until the event has been recorded and all preceding operations on its backend have completed.

void ggml_backend_event_synchronize(ggml_backend_event_t event);

event

ggml_backend_event_t

required

The event to wait on.

Device API

A device (ggml_backend_dev_t) represents a physical or logical hardware unit. Multiple backend streams can be created from a single device.

Device type enum

enum ggml_backend_dev_type {
    GGML_BACKEND_DEVICE_TYPE_CPU,   // CPU device using system memory
    GGML_BACKEND_DEVICE_TYPE_GPU,   // GPU device using dedicated memory
    GGML_BACKEND_DEVICE_TYPE_IGPU,  // integrated GPU using host memory
    GGML_BACKEND_DEVICE_TYPE_ACCEL, // accelerator (e.g. BLAS, AMX)
};

ggml_backend_dev_name

Returns the short name of the device (e.g. "CUDA0").

const char * ggml_backend_dev_name(ggml_backend_dev_t device);

device

ggml_backend_dev_t

required

The device to query.

ggml_backend_dev_description

Returns a longer human-readable description of the device (e.g. "NVIDIA GeForce RTX 4090").

const char * ggml_backend_dev_description(ggml_backend_dev_t device);

device

ggml_backend_dev_t

required

The device to query.

ggml_backend_dev_memory

Queries free and total memory available on the device.

void ggml_backend_dev_memory(
    ggml_backend_dev_t device,
    size_t           * free,
    size_t           * total);

device

ggml_backend_dev_t

required

The device to query.

free

size_t *

required

Output: free memory in bytes.

total

size_t *

required

Output: total memory in bytes.

ggml_backend_dev_type

Returns the type of the device.

enum ggml_backend_dev_type ggml_backend_dev_type(ggml_backend_dev_t device);

device

ggml_backend_dev_t

required

The device to query.

Backend scheduler

The scheduler (ggml_backend_sched_t) enables transparent multi-device execution. It partitions the computation graph, assigns operations to the most suitable backend, and handles buffer allocation and inter-device tensor copies automatically.

typedef struct ggml_backend_sched * ggml_backend_sched_t;

Backends with a lower index in the array passed to ggml_backend_sched_new have higher scheduling priority.

Example usage

// Mark weight buffers so the scheduler prefers to run ops on the same backend
ggml_backend_buffer_set_usage(buf_weights, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);

// Create a scheduler with GPU first, CPU as fallback
ggml_backend_t backends[]   = { backend_gpu, backend_cpu };
ggml_backend_sched_t sched  = ggml_backend_sched_new(
    backends, NULL, 2, GGML_DEFAULT_GRAPH_SIZE, false, true);

// Reserve buffers using a worst-case graph (optional but recommended)
ggml_backend_sched_reserve(sched, build_graph(sched, max_batch));

// Run the graph — allocation happens automatically on first call
struct ggml_cgraph * graph = build_graph(sched);
for (int i = 0; i < 10; ++i) {
    ggml_backend_sched_graph_compute(sched, graph);
}

ggml_backend_sched_free(sched);

ggml_backend_sched_new

Creates a new backend scheduler.

ggml_backend_sched_t ggml_backend_sched_new(
    ggml_backend_t             * backends,
    ggml_backend_buffer_type_t * bufts,
    int                          n_backends,
    size_t                       graph_size,
    bool                         parallel,
    bool                         op_offload);

backends

ggml_backend_t *

required

Array of backends to use. Index 0 has the highest priority.

bufts

ggml_backend_buffer_type_t *

Optional array of buffer types (one per backend). Pass NULL to use each backend’s default buffer type.

n_backends

int

required

Number of backends in the array.

graph_size

size_t

required

Maximum number of nodes expected in a computation graph. Use GGML_DEFAULT_GRAPH_SIZE if unsure.

parallel

bool

required

Whether to allow concurrent execution across backends.

op_offload

bool

required

Whether to offload supported operations to non-CPU backends automatically.

Free with ggml_backend_sched_free.

ggml_backend_sched_graph_compute

Allocates (if needed) and executes the computation graph across all scheduled backends.

enum ggml_status ggml_backend_sched_graph_compute(
    ggml_backend_sched_t sched,
    struct ggml_cgraph * graph);

sched

ggml_backend_sched_t

required

The scheduler.

graph

struct ggml_cgraph *

required

The computation graph to execute.

Returns GGML_STATUS_SUCCESS on success. On the first call, buffers are allocated automatically.

ggml_backend_sched_free

Destroys the scheduler and releases all associated resources.

void ggml_backend_sched_free(ggml_backend_sched_t sched);

sched

ggml_backend_sched_t

required

The scheduler to free.

Backend registry

The registry tracks all loaded backends and their devices. Use these functions to enumerate available hardware and load dynamic backend plugins.

ggml_backend_load

Loads a backend from a dynamic library file and registers it.

ggml_backend_reg_t ggml_backend_load(const char * path);

path

const char *

required

File system path to the shared library (e.g. "libggml-cuda.so").

Returns the registration handle, or NULL on failure. Unload with ggml_backend_unload.

ggml_backend_load_all

Discovers and loads all known backend shared libraries from the default search path.

void ggml_backend_load_all(void);

Call this once at startup if you want automatic hardware discovery without manually specifying backend paths.

ggml_backend_dev_count

Returns the total number of registered devices across all loaded backends.

size_t ggml_backend_dev_count(void);

ggml_backend_dev_get

Returns the device at the given index in the global device list.

ggml_backend_dev_t ggml_backend_dev_get(size_t index);

index

size_t

required

Zero-based device index. Must be less than ggml_backend_dev_count().

Core API

Backend API

Optimization API

GGUF API

Backend API

Type definitions

Buffer usage enum

Buffer type API

Buffer API

Backend (stream) API

Tensor operations

Graph computation

Synchronization

Device API

Device type enum

Backend scheduler

Example usage

Backend registry

Build docs developers (and LLMs) love

Core API

Backend API

Optimization API

GGUF API

​Type definitions

​Buffer usage enum

​Buffer type API

​Buffer API

​Backend (stream) API

​Tensor operations

​Graph computation

​Synchronization

​Device API

​Device type enum

​Backend scheduler

​Example usage

​Backend registry

Build docs developers (and LLMs) love

Type definitions

Buffer usage enum

Buffer type API

Buffer API

Backend (stream) API

Tensor operations

Graph computation

Synchronization

Device API

Device type enum

Backend scheduler

Example usage

Backend registry