REST API Overview

The llama.cpp server provides a fast, lightweight REST API for LLM inference. It implements OpenAI-compatible endpoints, allowing you to use existing OpenAI client libraries with llama.cpp models.

Features

OpenAI-compatible API: Drop-in replacement for OpenAI’s API endpoints
High Performance: Pure C/C++ implementation for maximum speed
GPU Acceleration: Support for CUDA, Metal, and other backends
Streaming Responses: Real-time token generation with Server-Sent Events
Multiple Models: Router mode for managing multiple models simultaneously
Multimodal Support: Vision and audio capabilities (experimental)
Function Calling: Tool use support for compatible models
Flexible Deployment: Docker, native binaries, or cloud platforms

Quick Start

Starting the Server

./llama-server -m models/7B/ggml-model.gguf -c 2048

The server will start on http://127.0.0.1:8080 by default.

Common Server Arguments

-m, --model

string

required

Path to the model file (GGUF format)

-c, --ctx-size

number

default:"0"

Size of the prompt context (0 = loaded from model)

-n, --predict

number

default:"-1"

Number of tokens to predict (-1 = infinity)

-ngl, --n-gpu-layers

string

default:"auto"

Number of layers to store in VRAM (auto, all, or specific number)

--host

string

default:"127.0.0.1"

IP address to bind to

--port

number

default:"8080"

Port to listen on

-np, --parallel

number

default:"-1"

Number of parallel slots for concurrent requests (-1 = auto)

--api-key

string

API key for authentication (can be comma-separated list for multiple keys)

Authentication

To enable API key authentication, start the server with the --api-key flag:

llama-server -m models/model.gguf --api-key sk-your-secret-key

Then include the key in the Authorization header:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-secret-key" \
  -d '{...}'

Without --api-key, the server runs in open mode. The health endpoint (/health) is always public regardless of authentication settings.

Using with OpenAI Client Libraries

The llama.cpp server is compatible with OpenAI’s client libraries:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="sk-no-key-required"  # Use actual key if authentication enabled
)

completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message.content)

Available Endpoints

OpenAI-Compatible Endpoints

POST /v1/chat/completions - Chat-based text generation
POST /v1/completions - Text completion
POST /v1/embeddings - Generate text embeddings
GET /v1/models - List available models

Native llama.cpp Endpoints

POST /completion - Native completion endpoint (not OAI-compatible)
POST /embedding - Native embeddings endpoint (not OAI-compatible)
POST /tokenize - Tokenize text
POST /detokenize - Convert tokens to text
GET /health - Health check endpoint
GET /props - Server properties and configuration
GET /slots - Monitor slot status and performance

Additional Features

POST /infill - Code infilling for completion
POST /reranking - Document reranking
GET /metrics - Prometheus-compatible metrics (requires --metrics flag)

Model Configuration

Setting Model Alias

By default, the model ID is the file path. You can set a custom alias:

llama-server -m models/model.gguf --alias gpt-4o-mini

Then use it in API requests:

{
  "model": "gpt-4o-mini",
  "messages": [...]
}

Downloading Models from Hugging Face

llama-server -hf bartowski/Llama-3.3-70B-Instruct-GGUF:Q4_K_M

This automatically downloads the model and multimodal projector (if available).

Health Check

Check if the server is ready:

curl http://localhost:8080/health

Responses:

200 OK with {"status": "ok"} - Server is ready
503 Service Unavailable with error message - Model is still loading

Environment Variables

Many arguments can be configured via environment variables:

export LLAMA_ARG_MODEL=/path/to/model.gguf
export LLAMA_ARG_CTX_SIZE=4096
export LLAMA_ARG_N_GPU_LAYERS=99
export LLAMA_API_KEY=sk-your-key

llama-server

Error Handling

The server returns OpenAI-compatible error responses:

{
  "error": {
    "code": 401,
    "message": "Invalid API Key",
    "type": "authentication_error"
  }
}

Common error types:

authentication_error - Invalid or missing API key
invalid_request_error - Malformed request
unavailable_error - Server not ready (model loading)
not_supported_error - Feature not enabled (e.g., metrics endpoint)

Next Steps

Chat Completions API - Interactive chat with models
Completions API - Simple text completion
Embeddings API - Generate vector embeddings

C/C++ API

REST API

Tools

Features

Quick Start

Starting the Server

Common Server Arguments

Authentication

Using with OpenAI Client Libraries

Available Endpoints

OpenAI-Compatible Endpoints

Native llama.cpp Endpoints

Additional Features

Model Configuration

Setting Model Alias

Downloading Models from Hugging Face

Health Check

Environment Variables

Error Handling

Next Steps

C/C++ API

REST API

Tools

​Features

​Quick Start

​Starting the Server

​Common Server Arguments

​Authentication

​Using with OpenAI Client Libraries

​Available Endpoints

​OpenAI-Compatible Endpoints

​Native llama.cpp Endpoints

​Additional Features

​Model Configuration

​Setting Model Alias

​Downloading Models from Hugging Face

​Health Check

​Environment Variables

​Error Handling

​Next Steps

Features

Quick Start

Starting the Server

Common Server Arguments

Authentication

Using with OpenAI Client Libraries

Available Endpoints

OpenAI-Compatible Endpoints

Native llama.cpp Endpoints

Additional Features

Model Configuration

Setting Model Alias

Downloading Models from Hugging Face

Health Check

Environment Variables

Error Handling

Next Steps