Skip to main content
ggml provides two complementary allocators:
  • ggml_tallocr (tensor allocator) — a simple linear allocator that assigns a single tensor into a pre-existing buffer.
  • ggml_gallocr (graph allocator) — a smart allocator that analyses a full computation graph, reuses intermediate memory where possible, and allocates all tensors in a single pass.
For most use cases, prefer ggml_gallocr. Use ggml_tallocr only when you need precise, manual control over individual tensor placement.

Tensor allocator (ggml_tallocr)

ggml_tallocr is a lightweight linear allocator backed by a single backend buffer.
struct ggml_tallocr {
    ggml_backend_buffer_t buffer;    // backing buffer
    void                * base;      // base pointer of the buffer
    size_t                alignment; // alignment requirement
    size_t                offset;    // current allocation offset
};
Creates a tensor allocator backed by an existing buffer.
struct ggml_tallocr ggml_tallocr_new(ggml_backend_buffer_t buffer);
buffer
ggml_backend_buffer_t
required
An already-allocated backend buffer. The allocator does not take ownership — you are still responsible for freeing the buffer.
Returns a value-type ggml_tallocr struct. No heap allocation is made by this call.
Allocates space for a single tensor within the allocator’s buffer.
enum ggml_status ggml_tallocr_alloc(
    struct ggml_tallocr * talloc,
    struct ggml_tensor  * tensor);
talloc
struct ggml_tallocr *
required
The allocator to use.
tensor
struct ggml_tensor *
required
The tensor whose data pointer will be set to the allocated region.
Returns GGML_STATUS_SUCCESS on success, or an error code if the buffer is exhausted.

Graph allocator (ggml_gallocr)

ggml_gallocr inspects the full computation graph, identifies tensors whose lifetimes do not overlap, and reuses memory between them. This significantly reduces peak memory usage compared to allocating each tensor independently.
typedef struct ggml_gallocr * ggml_gallocr_t;

Special tensor flags

Two flags influence graph allocator behaviour:
  • ggml_set_input(tensor) — input tensors are placed at non-overlapping addresses at the start of the graph so they remain valid throughout execution.
  • ggml_set_output(tensor) — output tensors are never freed or overwritten, ensuring their data is readable after ggml_gallocr_alloc_graph returns.

Quick start

// 1. Create a graph allocator for the CPU
ggml_gallocr_t galloc = ggml_gallocr_new(ggml_backend_cpu_buffer_type());

// 2. (Optional) Reserve with a worst-case graph to avoid reallocations later
ggml_gallocr_reserve(galloc, build_graph(max_batch));

// 3. Allocate a concrete graph
struct ggml_cgraph * graph = build_graph(batch);
ggml_gallocr_alloc_graph(galloc, graph);

printf("compute buffer: %zu bytes\n", ggml_gallocr_get_buffer_size(galloc, 0));

// 4. Execute
ggml_backend_graph_compute(backend, graph);

ggml_gallocr_free(galloc);
Creates a graph allocator that uses a single buffer type for all tensors.
ggml_gallocr_t ggml_gallocr_new(ggml_backend_buffer_type_t buft);
buft
ggml_backend_buffer_type_t
required
The buffer type to allocate from. Use ggml_backend_cpu_buffer_type() for CPU execution, or a device-specific type for GPU execution.
Free with ggml_gallocr_free.
Creates a graph allocator that can use multiple buffer types simultaneously — useful for multi-device graphs.
ggml_gallocr_t ggml_gallocr_new_n(
    ggml_backend_buffer_type_t * bufts,
    int                          n_bufs);
bufts
ggml_backend_buffer_type_t *
required
Array of buffer types, one per logical buffer region.
n_bufs
int
required
Number of buffer types in the array.
Frees the graph allocator and all buffers it owns.
void ggml_gallocr_free(ggml_gallocr_t galloc);
galloc
ggml_gallocr_t
required
The allocator to free.

Reservation

Calling ggml_gallocr_reserve with a worst-case graph pre-sizes all internal buffers. This avoids reallocation during the hot path and gives you a stable buffer size measurement.
Reservation is optional for single-buffer allocators: ggml_gallocr_alloc_graph will reallocate automatically if the graph topology changes. For multi-buffer allocators, you must call ggml_gallocr_reserve_n before the topology changes, or ggml_gallocr_alloc_graph will return false.
Pre-allocates internal buffers to fit the given graph without modifying any tensor data pointers.
bool ggml_gallocr_reserve(
    ggml_gallocr_t       galloc,
    struct ggml_cgraph * graph);
galloc
ggml_gallocr_t
required
The allocator to configure.
graph
struct ggml_cgraph *
required
A representative (ideally worst-case) computation graph.
Returns true on success. Returns false if the underlying buffer allocation failed.
Like ggml_gallocr_reserve, but also specifies which buffer index each node and leaf tensor should be placed in.
bool ggml_gallocr_reserve_n(
    ggml_gallocr_t       galloc,
    struct ggml_cgraph * graph,
    const int          * node_buffer_ids,
    const int          * leaf_buffer_ids);
galloc
ggml_gallocr_t
required
The allocator to configure.
graph
struct ggml_cgraph *
required
The representative computation graph.
node_buffer_ids
const int *
required
Array of buffer indices (one per node in the graph). Index i controls which buffer the i-th graph node is allocated from.
leaf_buffer_ids
const int *
required
Array of buffer indices (one per leaf tensor in the graph).

Allocation and sizing

Allocates all tensors in the graph, reusing memory between tensors whose lifetimes do not overlap.
bool ggml_gallocr_alloc_graph(
    ggml_gallocr_t       galloc,
    struct ggml_cgraph * graph);
galloc
ggml_gallocr_t
required
The allocator to use.
graph
struct ggml_cgraph *
required
The computation graph whose tensors will be allocated.
Returns true on success. For single-buffer allocators, the backing buffer is reallocated automatically if the graph topology changed since the last call. For multi-buffer allocators, returns false instead — call ggml_gallocr_reserve_n first.
Returns the size of the backing buffer for a given buffer index after allocation.
size_t ggml_gallocr_get_buffer_size(
    ggml_gallocr_t galloc,
    int            buffer_id);
galloc
ggml_gallocr_t
required
The allocator to query.
buffer_id
int
required
Zero-based buffer index. For single-buffer allocators, always pass 0.
Returns the size in bytes, or 0 if no buffer has been allocated yet.

Utility functions

These helpers allocate all tensors in a ggml_context into a single backend buffer in one call. They are the simplest way to prepare model weights for inference.
Allocates all tensors in the context into a new buffer of the given type.
struct ggml_backend_buffer * ggml_backend_alloc_ctx_tensors_from_buft(
    struct ggml_context        * ctx,
    ggml_backend_buffer_type_t   buft);
ctx
struct ggml_context *
required
The context whose tensors should be allocated. The context must have been created with no_alloc = true.
buft
ggml_backend_buffer_type_t
required
The buffer type to allocate from.
Returns the allocated buffer. The caller is responsible for freeing it with ggml_backend_buffer_free.
Allocates all tensors in the context using the backend’s default buffer type.
struct ggml_backend_buffer * ggml_backend_alloc_ctx_tensors(
    struct ggml_context * ctx,
    ggml_backend_t        backend);
ctx
struct ggml_context *
required
The context whose tensors should be allocated.
backend
ggml_backend_t
required
The backend whose default buffer type will be used.
Equivalent to ggml_backend_alloc_ctx_tensors_from_buft(ctx, ggml_backend_get_default_buffer_type(backend)).

When to use gallocr vs tallocr

ggml_gallocrggml_tallocr
Best forFull computation graphsIndividual tensors
Memory reuseYes — overlapping lifetimes share memoryNo — each tensor gets its own region
UsageCall alloc_graph once per graphCall alloc once per tensor
Multi-deviceYes (via new_n)No
OverheadAnalyses graph topologyMinimal
Use ggml_gallocr whenever you have a ggml_cgraph. Use ggml_tallocr for one-off allocations where you already have a buffer and want to place a single tensor at a known offset.

Build docs developers (and LLMs) love