RPC backend

The RPC backend lets a client machine offload tensor operations to one or more remote servers. Each server exposes a local ggml backend (CPU, CUDA, Metal, etc.) over a TCP socket. The client treats the remote device exactly like any other backend — the scheduler, buffer types, and graph compute API are identical.

Use cases

Offload inference to a remote machine with a powerful GPU
Distribute a large model across multiple machines when it does not fit in the memory of a single node
Run heterogeneous clusters where different nodes have different hardware

Protocol version

The RPC protocol is versioned. Client and server must use compatible versions:

#define RPC_PROTO_MAJOR_VERSION  3
#define RPC_PROTO_MINOR_VERSION  6
#define RPC_PROTO_PATCH_VERSION  1

The patch version increments with each change to GGML_OP_COUNT. Keep client and server binaries in sync.

Build

Enable the RPC backend on both the server and client machines:

cmake -B build -DGGML_RPC=ON
cmake --build build

Starting a server

A server hosts one or more local backends and listens on a TCP endpoint. Start one with ggml_backend_rpc_start_server:

#include "ggml-rpc.h"
#include "ggml-backend.h"

int main(void) {
    ggml_backend_load_all();

    // Expose a CUDA device over the network
    ggml_backend_dev_t dev = ggml_backend_dev_by_name("CUDA0");
    // Or expose the CPU:
    // ggml_backend_dev_t dev = ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);

    ggml_backend_dev_t devices[1] = { dev };

    // endpoint format: "host:port"
    ggml_backend_rpc_start_server(
        "0.0.0.0:50052",  // listen address
        NULL,              // cache directory (NULL = no cache)
        4,                 // number of CPU threads for server-side work
        1,                 // number of devices
        devices
    );
    // blocks until the server is stopped
    return 0;
}

The RPC server has no authentication. Only expose it on trusted networks or behind a firewall. Do not bind to a public interface in production without additional network security controls.

Connecting a client

On the client, initialise an RPC backend pointing at the server’s endpoint:

#include "ggml-rpc.h"

// Connect to device 0 on the remote server
ggml_backend_t rpc_backend = ggml_backend_rpc_init("192.168.1.10:50052", 0);
if (!rpc_backend) {
    fprintf(stderr, "failed to connect to RPC server\n");
    return 1;
}

You can also use the registry API to register a remote server and then use the standard device enumeration:

ggml_backend_reg_t reg = ggml_backend_rpc_add_server("192.168.1.10:50052");
// The server's devices are now visible via ggml_backend_dev_*

Querying remote memory

Before allocating buffers, check available memory on the remote device:

size_t free, total;
ggml_backend_rpc_get_device_memory("192.168.1.10:50052", 0, &free, &total);
printf("remote device: %.1f / %.1f GB free\n", free / 1e9, total / 1e9);

Multi-server setup

You can connect to several servers and use them together via the scheduler. The limit is GGML_RPC_MAX_SERVERS (16) connections per process.

ggml_backend_t rpc0 = ggml_backend_rpc_init("server-a:50052", 0);
ggml_backend_t rpc1 = ggml_backend_rpc_init("server-b:50052", 0);
ggml_backend_t cpu  = ggml_backend_cpu_init();

ggml_backend_t backends[3] = { rpc0, rpc1, cpu };
ggml_backend_sched_t sched = ggml_backend_sched_new(
    backends, NULL, 3, GGML_DEFAULT_GRAPH_SIZE, false, true
);

The scheduler distributes graph nodes across all connected servers based on where the weights live and which operations each backend supports.

Buffer type

To allocate tensors in the remote device’s memory, use the RPC buffer type:

ggml_backend_buffer_type_t buft =
    ggml_backend_rpc_buffer_type("192.168.1.10:50052", 0);
ggml_backend_buffer_t buf =
    ggml_backend_buft_alloc_buffer(buft, weights_size);

API summary

Function	Description
`ggml_backend_rpc_init(endpoint, device)`	Connect to a remote server and return a backend handle
`ggml_backend_is_rpc(backend)`	Check whether a backend is an RPC backend
`ggml_backend_rpc_buffer_type(endpoint, device)`	Buffer type for remote device memory
`ggml_backend_rpc_get_device_memory(endpoint, device, free, total)`	Query remote device memory
`ggml_backend_rpc_start_server(endpoint, cache_dir, n_threads, n_devices, devices)`	Start an RPC server (blocks)
`ggml_backend_rpc_reg()`	Return the RPC backend registry entry
`ggml_backend_rpc_add_server(endpoint)`	Register a remote server with the global device registry

Get Started

Core Concepts

Backends

Training

File Formats

Examples

RPC backend

Use cases

Protocol version

Build

Starting a server

Connecting a client

Querying remote memory

Multi-server setup

Buffer type

API summary

Build docs developers (and LLMs) love

Get Started

Core Concepts

Backends

Training

File Formats

Examples

​Use cases

​Protocol version

​Build

​Starting a server

​Connecting a client

​Querying remote memory

​Multi-server setup

​Buffer type

​API summary

Build docs developers (and LLMs) love

Use cases

Protocol version

Build

Starting a server

Connecting a client

Querying remote memory

Multi-server setup

Buffer type

API summary