The backend API provides a hardware-agnostic abstraction for executing GGML computation graphs. Backends represent specific hardware devices (CPU, GPU, etc.) and manage memory buffers, tensor data transfer, and graph execution.
Type definitions
typedef struct ggml_backend_buffer_type * ggml_backend_buffer_type_t ;
typedef struct ggml_backend_buffer * ggml_backend_buffer_t ;
typedef struct ggml_backend_event * ggml_backend_event_t ;
typedef struct ggml_backend * ggml_backend_t ;
typedef void * ggml_backend_graph_plan_t ;
typedef struct ggml_backend_reg * ggml_backend_reg_t ;
typedef struct ggml_backend_device * ggml_backend_dev_t ;
Buffer usage enum
enum ggml_backend_buffer_usage {
GGML_BACKEND_BUFFER_USAGE_ANY = 0 ,
GGML_BACKEND_BUFFER_USAGE_WEIGHTS = 1 ,
GGML_BACKEND_BUFFER_USAGE_COMPUTE = 2 ,
};
Value Description GGML_BACKEND_BUFFER_USAGE_ANYGeneral-purpose buffer with no special semantics GGML_BACKEND_BUFFER_USAGE_WEIGHTSBuffer holds model weights; operations on tensors in this buffer are preferably scheduled to the same backend GGML_BACKEND_BUFFER_USAGE_COMPUTEBuffer used for intermediate computation tensors
Set weight buffers to GGML_BACKEND_BUFFER_USAGE_WEIGHTS before creating a scheduler. This lets the scheduler co-locate operations with the weights and reduce cross-device copies.
Buffer type API
A buffer type (ggml_backend_buffer_type_t) describes how a backend allocates and manages memory. You use it to create concrete buffers.
Returns the human-readable name of a buffer type. const char * ggml_backend_buft_name ( ggml_backend_buffer_type_t buft );
buft
ggml_backend_buffer_type_t
required
The buffer type to query.
Returns a null-terminated string. The caller must not free it.
ggml_backend_buft_alloc_buffer
Allocates a new backend buffer of the given size. ggml_backend_buffer_t ggml_backend_buft_alloc_buffer (
ggml_backend_buffer_type_t buft ,
size_t size );
buft
ggml_backend_buffer_type_t
required
The buffer type that defines where the memory is allocated.
Size of the buffer in bytes.
Returns a new buffer, or NULL on failure. Free with ggml_backend_buffer_free.
ggml_backend_buft_get_alignment
Returns the required memory alignment for this buffer type in bytes. size_t ggml_backend_buft_get_alignment ( ggml_backend_buffer_type_t buft );
buft
ggml_backend_buffer_type_t
required
The buffer type to query.
ggml_backend_buft_is_host
Returns true if the buffer type is accessible directly from the host CPU. bool ggml_backend_buft_is_host ( ggml_backend_buffer_type_t buft );
buft
ggml_backend_buffer_type_t
required
The buffer type to query.
Buffer API
A buffer (ggml_backend_buffer_t) is a concrete allocation of device memory. Tensors are assigned into buffers before being used in graph computation.
Returns the human-readable name of a buffer. const char * ggml_backend_buffer_name ( ggml_backend_buffer_t buffer );
buffer
ggml_backend_buffer_t
required
The buffer to query.
Frees a backend buffer and releases all memory it holds. void ggml_backend_buffer_free ( ggml_backend_buffer_t buffer );
buffer
ggml_backend_buffer_t
required
The buffer to free. Passing NULL is safe.
ggml_backend_buffer_get_base
Returns a raw pointer to the start of the buffer’s memory region. void * ggml_backend_buffer_get_base ( ggml_backend_buffer_t buffer );
buffer
ggml_backend_buffer_t
required
The buffer to query.
Returns NULL for device buffers not directly accessible from the host.
ggml_backend_buffer_get_size
Returns the total size of the buffer in bytes. size_t ggml_backend_buffer_get_size ( ggml_backend_buffer_t buffer );
buffer
ggml_backend_buffer_t
required
The buffer to query.
ggml_backend_buffer_clear
Fills the entire buffer with a constant byte value. void ggml_backend_buffer_clear ( ggml_backend_buffer_t buffer , uint8_t value );
buffer
ggml_backend_buffer_t
required
The buffer to clear.
The byte value to fill every byte of the buffer with.
ggml_backend_buffer_is_host
Returns true if the buffer is in host-accessible memory. bool ggml_backend_buffer_is_host ( ggml_backend_buffer_t buffer );
buffer
ggml_backend_buffer_t
required
The buffer to query.
Backend (stream) API
A backend (ggml_backend_t) represents a compute stream on a device. Most programs create one backend per device and use it throughout the session.
Returns a globally unique identifier for the backend instance. ggml_guid_t ggml_backend_guid ( ggml_backend_t backend );
Returns the human-readable name of the backend. const char * ggml_backend_name ( ggml_backend_t backend );
Destroys the backend and releases its resources. void ggml_backend_free ( ggml_backend_t backend );
ggml_backend_get_default_buffer_type
Returns the default buffer type for this backend. Use this type when allocating buffers without a specific device preference. ggml_backend_buffer_type_t ggml_backend_get_default_buffer_type (
ggml_backend_t backend );
ggml_backend_alloc_buffer
Allocates a buffer of the given size using the backend’s default buffer type. ggml_backend_buffer_t ggml_backend_alloc_buffer (
ggml_backend_t backend ,
size_t size );
The backend that owns the allocation.
Equivalent to calling ggml_backend_buft_alloc_buffer with ggml_backend_get_default_buffer_type(backend).
Tensor operations
Copies data from a host buffer into a tensor (synchronous). void ggml_backend_tensor_set (
struct ggml_tensor * tensor ,
const void * data ,
size_t offset ,
size_t size );
tensor
struct ggml_tensor *
required
Destination tensor.
Source data in host memory.
Byte offset into tensor->data at which to start writing.
Copies data from a tensor into a host buffer (synchronous). void ggml_backend_tensor_get (
const struct ggml_tensor * tensor ,
void * data ,
size_t offset ,
size_t size );
tensor
const struct ggml_tensor *
required
Source tensor.
Destination buffer in host memory.
Byte offset into tensor->data at which to start reading.
Copies tensor data between two backends. Either or both may be device backends. void ggml_backend_tensor_copy (
struct ggml_tensor * src ,
struct ggml_tensor * dst );
src
struct ggml_tensor *
required
Source tensor (can reside on any backend).
dst
struct ggml_tensor *
required
Destination tensor (can reside on any backend).
The source and destination shapes and types must match.
Graph computation
ggml_backend_graph_plan_create
Creates a reusable execution plan for a computation graph. Plans can be executed multiple times without re-analyzing the graph structure. ggml_backend_graph_plan_t ggml_backend_graph_plan_create (
ggml_backend_t backend ,
struct ggml_cgraph * cgraph );
The backend that will execute the plan.
cgraph
struct ggml_cgraph *
required
The computation graph to plan.
Free the plan with ggml_backend_graph_plan_free when done.
ggml_backend_graph_plan_compute
Executes a previously created graph plan. enum ggml_status ggml_backend_graph_plan_compute (
ggml_backend_t backend ,
ggml_backend_graph_plan_t plan );
The backend that owns the plan.
plan
ggml_backend_graph_plan_t
required
The plan to execute.
Returns GGML_STATUS_SUCCESS on success.
ggml_backend_graph_compute
Executes a computation graph directly, without a pre-created plan. enum ggml_status ggml_backend_graph_compute (
ggml_backend_t backend ,
struct ggml_cgraph * cgraph );
The backend to run the graph on.
cgraph
struct ggml_cgraph *
required
The computation graph to execute.
Returns GGML_STATUS_SUCCESS on success. For repeated execution of the same graph topology, prefer creating a plan with ggml_backend_graph_plan_create.
Synchronization
Creates a new synchronization event on the given device. ggml_backend_event_t ggml_backend_event_new ( ggml_backend_dev_t device );
device
ggml_backend_dev_t
required
The device that will record and wait on the event.
Free with ggml_backend_event_free.
Destroys a synchronization event. void ggml_backend_event_free ( ggml_backend_event_t event );
event
ggml_backend_event_t
required
The event to destroy.
ggml_backend_event_synchronize
Blocks the calling thread until the event has been recorded and all preceding operations on its backend have completed. void ggml_backend_event_synchronize ( ggml_backend_event_t event );
event
ggml_backend_event_t
required
The event to wait on.
Device API
A device (ggml_backend_dev_t) represents a physical or logical hardware unit. Multiple backend streams can be created from a single device.
Device type enum
enum ggml_backend_dev_type {
GGML_BACKEND_DEVICE_TYPE_CPU, // CPU device using system memory
GGML_BACKEND_DEVICE_TYPE_GPU, // GPU device using dedicated memory
GGML_BACKEND_DEVICE_TYPE_IGPU, // integrated GPU using host memory
GGML_BACKEND_DEVICE_TYPE_ACCEL, // accelerator (e.g. BLAS, AMX)
};
Returns the short name of the device (e.g. "CUDA0"). const char * ggml_backend_dev_name ( ggml_backend_dev_t device );
device
ggml_backend_dev_t
required
The device to query.
ggml_backend_dev_description
Returns a longer human-readable description of the device (e.g. "NVIDIA GeForce RTX 4090"). const char * ggml_backend_dev_description ( ggml_backend_dev_t device );
device
ggml_backend_dev_t
required
The device to query.
Queries free and total memory available on the device. void ggml_backend_dev_memory (
ggml_backend_dev_t device ,
size_t * free ,
size_t * total );
device
ggml_backend_dev_t
required
The device to query.
Output: free memory in bytes.
Output: total memory in bytes.
Returns the type of the device. enum ggml_backend_dev_type ggml_backend_dev_type ( ggml_backend_dev_t device );
device
ggml_backend_dev_t
required
The device to query.
Backend scheduler
The scheduler (ggml_backend_sched_t) enables transparent multi-device execution. It partitions the computation graph, assigns operations to the most suitable backend, and handles buffer allocation and inter-device tensor copies automatically.
typedef struct ggml_backend_sched * ggml_backend_sched_t ;
Backends with a lower index in the array passed to ggml_backend_sched_new have higher scheduling priority.
Example usage
// Mark weight buffers so the scheduler prefers to run ops on the same backend
ggml_backend_buffer_set_usage (buf_weights, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
// Create a scheduler with GPU first, CPU as fallback
ggml_backend_t backends [] = { backend_gpu, backend_cpu };
ggml_backend_sched_t sched = ggml_backend_sched_new (
backends, NULL , 2 , GGML_DEFAULT_GRAPH_SIZE, false , true );
// Reserve buffers using a worst-case graph (optional but recommended)
ggml_backend_sched_reserve (sched, build_graph (sched, max_batch));
// Run the graph — allocation happens automatically on first call
struct ggml_cgraph * graph = build_graph (sched);
for ( int i = 0 ; i < 10 ; ++ i) {
ggml_backend_sched_graph_compute (sched, graph);
}
ggml_backend_sched_free (sched);
Creates a new backend scheduler. ggml_backend_sched_t ggml_backend_sched_new (
ggml_backend_t * backends ,
ggml_backend_buffer_type_t * bufts ,
int n_backends ,
size_t graph_size ,
bool parallel ,
bool op_offload );
Array of backends to use. Index 0 has the highest priority.
bufts
ggml_backend_buffer_type_t *
Optional array of buffer types (one per backend). Pass NULL to use each backend’s default buffer type.
Number of backends in the array.
Maximum number of nodes expected in a computation graph. Use GGML_DEFAULT_GRAPH_SIZE if unsure.
Whether to allow concurrent execution across backends.
Whether to offload supported operations to non-CPU backends automatically.
Free with ggml_backend_sched_free.
ggml_backend_sched_graph_compute
Allocates (if needed) and executes the computation graph across all scheduled backends. enum ggml_status ggml_backend_sched_graph_compute (
ggml_backend_sched_t sched ,
struct ggml_cgraph * graph );
sched
ggml_backend_sched_t
required
The scheduler.
graph
struct ggml_cgraph *
required
The computation graph to execute.
Returns GGML_STATUS_SUCCESS on success. On the first call, buffers are allocated automatically.
Destroys the scheduler and releases all associated resources. void ggml_backend_sched_free ( ggml_backend_sched_t sched );
sched
ggml_backend_sched_t
required
The scheduler to free.
Backend registry
The registry tracks all loaded backends and their devices. Use these functions to enumerate available hardware and load dynamic backend plugins.
Loads a backend from a dynamic library file and registers it. ggml_backend_reg_t ggml_backend_load ( const char * path );
File system path to the shared library (e.g. "libggml-cuda.so").
Returns the registration handle, or NULL on failure. Unload with ggml_backend_unload.
Discovers and loads all known backend shared libraries from the default search path. void ggml_backend_load_all ( void );
Call this once at startup if you want automatic hardware discovery without manually specifying backend paths.
Returns the total number of registered devices across all loaded backends. size_t ggml_backend_dev_count ( void );
Returns the device at the given index in the global device list. ggml_backend_dev_t ggml_backend_dev_get ( size_t index );
Zero-based device index. Must be less than ggml_backend_dev_count().