ggml_tallocr(tensor allocator) — a simple linear allocator that assigns a single tensor into a pre-existing buffer.ggml_gallocr(graph allocator) — a smart allocator that analyses a full computation graph, reuses intermediate memory where possible, and allocates all tensors in a single pass.
ggml_gallocr. Use ggml_tallocr only when you need precise, manual control over individual tensor placement.
Tensor allocator (ggml_tallocr)
ggml_tallocr is a lightweight linear allocator backed by a single backend buffer.
ggml_tallocr_new
ggml_tallocr_new
Creates a tensor allocator backed by an existing buffer.Returns a value-type
An already-allocated backend buffer. The allocator does not take ownership — you are still responsible for freeing the buffer.
ggml_tallocr struct. No heap allocation is made by this call.ggml_tallocr_alloc
ggml_tallocr_alloc
Allocates space for a single tensor within the allocator’s buffer.Returns
The allocator to use.
The tensor whose
data pointer will be set to the allocated region.GGML_STATUS_SUCCESS on success, or an error code if the buffer is exhausted.Graph allocator (ggml_gallocr)
ggml_gallocr inspects the full computation graph, identifies tensors whose lifetimes do not overlap, and reuses memory between them. This significantly reduces peak memory usage compared to allocating each tensor independently.
Special tensor flags
Two flags influence graph allocator behaviour:ggml_set_input(tensor)— input tensors are placed at non-overlapping addresses at the start of the graph so they remain valid throughout execution.ggml_set_output(tensor)— output tensors are never freed or overwritten, ensuring their data is readable afterggml_gallocr_alloc_graphreturns.
Quick start
ggml_gallocr_new
ggml_gallocr_new
Creates a graph allocator that uses a single buffer type for all tensors.Free with
The buffer type to allocate from. Use
ggml_backend_cpu_buffer_type() for CPU execution, or a device-specific type for GPU execution.ggml_gallocr_free.ggml_gallocr_new_n
ggml_gallocr_new_n
ggml_gallocr_free
ggml_gallocr_free
Frees the graph allocator and all buffers it owns.
The allocator to free.
Reservation
Callingggml_gallocr_reserve with a worst-case graph pre-sizes all internal buffers. This avoids reallocation during the hot path and gives you a stable buffer size measurement.
Reservation is optional for single-buffer allocators:
ggml_gallocr_alloc_graph will reallocate automatically if the graph topology changes. For multi-buffer allocators, you must call ggml_gallocr_reserve_n before the topology changes, or ggml_gallocr_alloc_graph will return false.ggml_gallocr_reserve
ggml_gallocr_reserve
Pre-allocates internal buffers to fit the given graph without modifying any tensor data pointers.Returns
The allocator to configure.
A representative (ideally worst-case) computation graph.
true on success. Returns false if the underlying buffer allocation failed.ggml_gallocr_reserve_n
ggml_gallocr_reserve_n
Like
ggml_gallocr_reserve, but also specifies which buffer index each node and leaf tensor should be placed in.The allocator to configure.
The representative computation graph.
Array of buffer indices (one per node in the graph). Index
i controls which buffer the i-th graph node is allocated from.Array of buffer indices (one per leaf tensor in the graph).
Allocation and sizing
ggml_gallocr_alloc_graph
ggml_gallocr_alloc_graph
Allocates all tensors in the graph, reusing memory between tensors whose lifetimes do not overlap.Returns
The allocator to use.
The computation graph whose tensors will be allocated.
true on success. For single-buffer allocators, the backing buffer is reallocated automatically if the graph topology changed since the last call. For multi-buffer allocators, returns false instead — call ggml_gallocr_reserve_n first.ggml_gallocr_get_buffer_size
ggml_gallocr_get_buffer_size
Utility functions
These helpers allocate all tensors in aggml_context into a single backend buffer in one call. They are the simplest way to prepare model weights for inference.
ggml_backend_alloc_ctx_tensors_from_buft
ggml_backend_alloc_ctx_tensors_from_buft
Allocates all tensors in the context into a new buffer of the given type.Returns the allocated buffer. The caller is responsible for freeing it with
The context whose tensors should be allocated. The context must have been created with
no_alloc = true.The buffer type to allocate from.
ggml_backend_buffer_free.ggml_backend_alloc_ctx_tensors
ggml_backend_alloc_ctx_tensors
Allocates all tensors in the context using the backend’s default buffer type.Equivalent to
The context whose tensors should be allocated.
The backend whose default buffer type will be used.
ggml_backend_alloc_ctx_tensors_from_buft(ctx, ggml_backend_get_default_buffer_type(backend)).When to use gallocr vs tallocr
ggml_gallocr | ggml_tallocr | |
|---|---|---|
| Best for | Full computation graphs | Individual tensors |
| Memory reuse | Yes — overlapping lifetimes share memory | No — each tensor gets its own region |
| Usage | Call alloc_graph once per graph | Call alloc once per tensor |
| Multi-device | Yes (via new_n) | No |
| Overhead | Analyses graph topology | Minimal |
ggml_gallocr whenever you have a ggml_cgraph. Use ggml_tallocr for one-off allocations where you already have a buffer and want to place a single tensor at a known offset.