Skip to main content
The backend API provides a hardware-agnostic abstraction for executing GGML computation graphs. Backends represent specific hardware devices (CPU, GPU, etc.) and manage memory buffers, tensor data transfer, and graph execution.

Type definitions

typedef struct ggml_backend_buffer_type * ggml_backend_buffer_type_t;
typedef struct ggml_backend_buffer      * ggml_backend_buffer_t;
typedef struct ggml_backend_event       * ggml_backend_event_t;
typedef struct ggml_backend             * ggml_backend_t;
typedef void                            * ggml_backend_graph_plan_t;
typedef struct ggml_backend_reg         * ggml_backend_reg_t;
typedef struct ggml_backend_device      * ggml_backend_dev_t;

Buffer usage enum

enum ggml_backend_buffer_usage {
    GGML_BACKEND_BUFFER_USAGE_ANY     = 0,
    GGML_BACKEND_BUFFER_USAGE_WEIGHTS = 1,
    GGML_BACKEND_BUFFER_USAGE_COMPUTE = 2,
};
ValueDescription
GGML_BACKEND_BUFFER_USAGE_ANYGeneral-purpose buffer with no special semantics
GGML_BACKEND_BUFFER_USAGE_WEIGHTSBuffer holds model weights; operations on tensors in this buffer are preferably scheduled to the same backend
GGML_BACKEND_BUFFER_USAGE_COMPUTEBuffer used for intermediate computation tensors
Set weight buffers to GGML_BACKEND_BUFFER_USAGE_WEIGHTS before creating a scheduler. This lets the scheduler co-locate operations with the weights and reduce cross-device copies.

Buffer type API

A buffer type (ggml_backend_buffer_type_t) describes how a backend allocates and manages memory. You use it to create concrete buffers.
Returns the human-readable name of a buffer type.
const char * ggml_backend_buft_name(ggml_backend_buffer_type_t buft);
buft
ggml_backend_buffer_type_t
required
The buffer type to query.
Returns a null-terminated string. The caller must not free it.
Allocates a new backend buffer of the given size.
ggml_backend_buffer_t ggml_backend_buft_alloc_buffer(
    ggml_backend_buffer_type_t buft,
    size_t size);
buft
ggml_backend_buffer_type_t
required
The buffer type that defines where the memory is allocated.
size
size_t
required
Size of the buffer in bytes.
Returns a new buffer, or NULL on failure. Free with ggml_backend_buffer_free.
Returns the required memory alignment for this buffer type in bytes.
size_t ggml_backend_buft_get_alignment(ggml_backend_buffer_type_t buft);
buft
ggml_backend_buffer_type_t
required
The buffer type to query.
Returns true if the buffer type is accessible directly from the host CPU.
bool ggml_backend_buft_is_host(ggml_backend_buffer_type_t buft);
buft
ggml_backend_buffer_type_t
required
The buffer type to query.

Buffer API

A buffer (ggml_backend_buffer_t) is a concrete allocation of device memory. Tensors are assigned into buffers before being used in graph computation.
Returns the human-readable name of a buffer.
const char * ggml_backend_buffer_name(ggml_backend_buffer_t buffer);
buffer
ggml_backend_buffer_t
required
The buffer to query.
Frees a backend buffer and releases all memory it holds.
void ggml_backend_buffer_free(ggml_backend_buffer_t buffer);
buffer
ggml_backend_buffer_t
required
The buffer to free. Passing NULL is safe.
Returns a raw pointer to the start of the buffer’s memory region.
void * ggml_backend_buffer_get_base(ggml_backend_buffer_t buffer);
buffer
ggml_backend_buffer_t
required
The buffer to query.
Returns NULL for device buffers not directly accessible from the host.
Returns the total size of the buffer in bytes.
size_t ggml_backend_buffer_get_size(ggml_backend_buffer_t buffer);
buffer
ggml_backend_buffer_t
required
The buffer to query.
Fills the entire buffer with a constant byte value.
void ggml_backend_buffer_clear(ggml_backend_buffer_t buffer, uint8_t value);
buffer
ggml_backend_buffer_t
required
The buffer to clear.
value
uint8_t
required
The byte value to fill every byte of the buffer with.
Returns true if the buffer is in host-accessible memory.
bool ggml_backend_buffer_is_host(ggml_backend_buffer_t buffer);
buffer
ggml_backend_buffer_t
required
The buffer to query.

Backend (stream) API

A backend (ggml_backend_t) represents a compute stream on a device. Most programs create one backend per device and use it throughout the session.
Returns a globally unique identifier for the backend instance.
ggml_guid_t ggml_backend_guid(ggml_backend_t backend);
backend
ggml_backend_t
required
The backend to query.
Returns the human-readable name of the backend.
const char * ggml_backend_name(ggml_backend_t backend);
backend
ggml_backend_t
required
The backend to query.
Destroys the backend and releases its resources.
void ggml_backend_free(ggml_backend_t backend);
backend
ggml_backend_t
required
The backend to free.
Returns the default buffer type for this backend. Use this type when allocating buffers without a specific device preference.
ggml_backend_buffer_type_t ggml_backend_get_default_buffer_type(
    ggml_backend_t backend);
backend
ggml_backend_t
required
The backend to query.
Allocates a buffer of the given size using the backend’s default buffer type.
ggml_backend_buffer_t ggml_backend_alloc_buffer(
    ggml_backend_t backend,
    size_t         size);
backend
ggml_backend_t
required
The backend that owns the allocation.
size
size_t
required
Size in bytes.
Equivalent to calling ggml_backend_buft_alloc_buffer with ggml_backend_get_default_buffer_type(backend).

Tensor operations

Copies data from a host buffer into a tensor (synchronous).
void ggml_backend_tensor_set(
    struct ggml_tensor * tensor,
    const void         * data,
    size_t               offset,
    size_t               size);
tensor
struct ggml_tensor *
required
Destination tensor.
data
const void *
required
Source data in host memory.
offset
size_t
required
Byte offset into tensor->data at which to start writing.
size
size_t
required
Number of bytes to copy.
Copies data from a tensor into a host buffer (synchronous).
void ggml_backend_tensor_get(
    const struct ggml_tensor * tensor,
    void                     * data,
    size_t                     offset,
    size_t                     size);
tensor
const struct ggml_tensor *
required
Source tensor.
data
void *
required
Destination buffer in host memory.
offset
size_t
required
Byte offset into tensor->data at which to start reading.
size
size_t
required
Number of bytes to copy.
Copies tensor data between two backends. Either or both may be device backends.
void ggml_backend_tensor_copy(
    struct ggml_tensor * src,
    struct ggml_tensor * dst);
src
struct ggml_tensor *
required
Source tensor (can reside on any backend).
dst
struct ggml_tensor *
required
Destination tensor (can reside on any backend).
The source and destination shapes and types must match.

Graph computation

Creates a reusable execution plan for a computation graph. Plans can be executed multiple times without re-analyzing the graph structure.
ggml_backend_graph_plan_t ggml_backend_graph_plan_create(
    ggml_backend_t       backend,
    struct ggml_cgraph * cgraph);
backend
ggml_backend_t
required
The backend that will execute the plan.
cgraph
struct ggml_cgraph *
required
The computation graph to plan.
Free the plan with ggml_backend_graph_plan_free when done.
Executes a previously created graph plan.
enum ggml_status ggml_backend_graph_plan_compute(
    ggml_backend_t            backend,
    ggml_backend_graph_plan_t plan);
backend
ggml_backend_t
required
The backend that owns the plan.
plan
ggml_backend_graph_plan_t
required
The plan to execute.
Returns GGML_STATUS_SUCCESS on success.
Executes a computation graph directly, without a pre-created plan.
enum ggml_status ggml_backend_graph_compute(
    ggml_backend_t       backend,
    struct ggml_cgraph * cgraph);
backend
ggml_backend_t
required
The backend to run the graph on.
cgraph
struct ggml_cgraph *
required
The computation graph to execute.
Returns GGML_STATUS_SUCCESS on success. For repeated execution of the same graph topology, prefer creating a plan with ggml_backend_graph_plan_create.

Synchronization

Creates a new synchronization event on the given device.
ggml_backend_event_t ggml_backend_event_new(ggml_backend_dev_t device);
device
ggml_backend_dev_t
required
The device that will record and wait on the event.
Free with ggml_backend_event_free.
Destroys a synchronization event.
void ggml_backend_event_free(ggml_backend_event_t event);
event
ggml_backend_event_t
required
The event to destroy.
Blocks the calling thread until the event has been recorded and all preceding operations on its backend have completed.
void ggml_backend_event_synchronize(ggml_backend_event_t event);
event
ggml_backend_event_t
required
The event to wait on.

Device API

A device (ggml_backend_dev_t) represents a physical or logical hardware unit. Multiple backend streams can be created from a single device.

Device type enum

enum ggml_backend_dev_type {
    GGML_BACKEND_DEVICE_TYPE_CPU,   // CPU device using system memory
    GGML_BACKEND_DEVICE_TYPE_GPU,   // GPU device using dedicated memory
    GGML_BACKEND_DEVICE_TYPE_IGPU,  // integrated GPU using host memory
    GGML_BACKEND_DEVICE_TYPE_ACCEL, // accelerator (e.g. BLAS, AMX)
};
Returns the short name of the device (e.g. "CUDA0").
const char * ggml_backend_dev_name(ggml_backend_dev_t device);
device
ggml_backend_dev_t
required
The device to query.
Returns a longer human-readable description of the device (e.g. "NVIDIA GeForce RTX 4090").
const char * ggml_backend_dev_description(ggml_backend_dev_t device);
device
ggml_backend_dev_t
required
The device to query.
Queries free and total memory available on the device.
void ggml_backend_dev_memory(
    ggml_backend_dev_t device,
    size_t           * free,
    size_t           * total);
device
ggml_backend_dev_t
required
The device to query.
free
size_t *
required
Output: free memory in bytes.
total
size_t *
required
Output: total memory in bytes.
Returns the type of the device.
enum ggml_backend_dev_type ggml_backend_dev_type(ggml_backend_dev_t device);
device
ggml_backend_dev_t
required
The device to query.

Backend scheduler

The scheduler (ggml_backend_sched_t) enables transparent multi-device execution. It partitions the computation graph, assigns operations to the most suitable backend, and handles buffer allocation and inter-device tensor copies automatically.
typedef struct ggml_backend_sched * ggml_backend_sched_t;
Backends with a lower index in the array passed to ggml_backend_sched_new have higher scheduling priority.

Example usage

// Mark weight buffers so the scheduler prefers to run ops on the same backend
ggml_backend_buffer_set_usage(buf_weights, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);

// Create a scheduler with GPU first, CPU as fallback
ggml_backend_t backends[]   = { backend_gpu, backend_cpu };
ggml_backend_sched_t sched  = ggml_backend_sched_new(
    backends, NULL, 2, GGML_DEFAULT_GRAPH_SIZE, false, true);

// Reserve buffers using a worst-case graph (optional but recommended)
ggml_backend_sched_reserve(sched, build_graph(sched, max_batch));

// Run the graph — allocation happens automatically on first call
struct ggml_cgraph * graph = build_graph(sched);
for (int i = 0; i < 10; ++i) {
    ggml_backend_sched_graph_compute(sched, graph);
}

ggml_backend_sched_free(sched);
Creates a new backend scheduler.
ggml_backend_sched_t ggml_backend_sched_new(
    ggml_backend_t             * backends,
    ggml_backend_buffer_type_t * bufts,
    int                          n_backends,
    size_t                       graph_size,
    bool                         parallel,
    bool                         op_offload);
backends
ggml_backend_t *
required
Array of backends to use. Index 0 has the highest priority.
bufts
ggml_backend_buffer_type_t *
Optional array of buffer types (one per backend). Pass NULL to use each backend’s default buffer type.
n_backends
int
required
Number of backends in the array.
graph_size
size_t
required
Maximum number of nodes expected in a computation graph. Use GGML_DEFAULT_GRAPH_SIZE if unsure.
parallel
bool
required
Whether to allow concurrent execution across backends.
op_offload
bool
required
Whether to offload supported operations to non-CPU backends automatically.
Free with ggml_backend_sched_free.
Allocates (if needed) and executes the computation graph across all scheduled backends.
enum ggml_status ggml_backend_sched_graph_compute(
    ggml_backend_sched_t sched,
    struct ggml_cgraph * graph);
sched
ggml_backend_sched_t
required
The scheduler.
graph
struct ggml_cgraph *
required
The computation graph to execute.
Returns GGML_STATUS_SUCCESS on success. On the first call, buffers are allocated automatically.
Destroys the scheduler and releases all associated resources.
void ggml_backend_sched_free(ggml_backend_sched_t sched);
sched
ggml_backend_sched_t
required
The scheduler to free.

Backend registry

The registry tracks all loaded backends and their devices. Use these functions to enumerate available hardware and load dynamic backend plugins.
Loads a backend from a dynamic library file and registers it.
ggml_backend_reg_t ggml_backend_load(const char * path);
path
const char *
required
File system path to the shared library (e.g. "libggml-cuda.so").
Returns the registration handle, or NULL on failure. Unload with ggml_backend_unload.
Discovers and loads all known backend shared libraries from the default search path.
void ggml_backend_load_all(void);
Call this once at startup if you want automatic hardware discovery without manually specifying backend paths.
Returns the total number of registered devices across all loaded backends.
size_t ggml_backend_dev_count(void);
Returns the device at the given index in the global device list.
ggml_backend_dev_t ggml_backend_dev_get(size_t index);
index
size_t
required
Zero-based device index. Must be less than ggml_backend_dev_count().

Build docs developers (and LLMs) love