Skip to main content
The llama.cpp server provides a fast, lightweight REST API for LLM inference. It implements OpenAI-compatible endpoints, allowing you to use existing OpenAI client libraries with llama.cpp models.

Features

  • OpenAI-compatible API: Drop-in replacement for OpenAI’s API endpoints
  • High Performance: Pure C/C++ implementation for maximum speed
  • GPU Acceleration: Support for CUDA, Metal, and other backends
  • Streaming Responses: Real-time token generation with Server-Sent Events
  • Multiple Models: Router mode for managing multiple models simultaneously
  • Multimodal Support: Vision and audio capabilities (experimental)
  • Function Calling: Tool use support for compatible models
  • Flexible Deployment: Docker, native binaries, or cloud platforms

Quick Start

Starting the Server

./llama-server -m models/7B/ggml-model.gguf -c 2048
The server will start on http://127.0.0.1:8080 by default.

Common Server Arguments

-m, --model
string
required
Path to the model file (GGUF format)
-c, --ctx-size
number
default:"0"
Size of the prompt context (0 = loaded from model)
-n, --predict
number
default:"-1"
Number of tokens to predict (-1 = infinity)
-ngl, --n-gpu-layers
string
default:"auto"
Number of layers to store in VRAM (auto, all, or specific number)
--host
string
default:"127.0.0.1"
IP address to bind to
--port
number
default:"8080"
Port to listen on
-np, --parallel
number
default:"-1"
Number of parallel slots for concurrent requests (-1 = auto)
--api-key
string
API key for authentication (can be comma-separated list for multiple keys)

Authentication

To enable API key authentication, start the server with the --api-key flag:
llama-server -m models/model.gguf --api-key sk-your-secret-key
Then include the key in the Authorization header:
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-secret-key" \
  -d '{...}'
Without --api-key, the server runs in open mode. The health endpoint (/health) is always public regardless of authentication settings.

Using with OpenAI Client Libraries

The llama.cpp server is compatible with OpenAI’s client libraries:
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="sk-no-key-required"  # Use actual key if authentication enabled
)

completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message.content)

Available Endpoints

OpenAI-Compatible Endpoints

Native llama.cpp Endpoints

  • POST /completion - Native completion endpoint (not OAI-compatible)
  • POST /embedding - Native embeddings endpoint (not OAI-compatible)
  • POST /tokenize - Tokenize text
  • POST /detokenize - Convert tokens to text
  • GET /health - Health check endpoint
  • GET /props - Server properties and configuration
  • GET /slots - Monitor slot status and performance

Additional Features

  • POST /infill - Code infilling for completion
  • POST /reranking - Document reranking
  • GET /metrics - Prometheus-compatible metrics (requires --metrics flag)

Model Configuration

Setting Model Alias

By default, the model ID is the file path. You can set a custom alias:
llama-server -m models/model.gguf --alias gpt-4o-mini
Then use it in API requests:
{
  "model": "gpt-4o-mini",
  "messages": [...]
}

Downloading Models from Hugging Face

llama-server -hf bartowski/Llama-3.3-70B-Instruct-GGUF:Q4_K_M
This automatically downloads the model and multimodal projector (if available).

Health Check

Check if the server is ready:
curl http://localhost:8080/health
Responses:
  • 200 OK with {"status": "ok"} - Server is ready
  • 503 Service Unavailable with error message - Model is still loading

Environment Variables

Many arguments can be configured via environment variables:
export LLAMA_ARG_MODEL=/path/to/model.gguf
export LLAMA_ARG_CTX_SIZE=4096
export LLAMA_ARG_N_GPU_LAYERS=99
export LLAMA_API_KEY=sk-your-key

llama-server

Error Handling

The server returns OpenAI-compatible error responses:
{
  "error": {
    "code": 401,
    "message": "Invalid API Key",
    "type": "authentication_error"
  }
}
Common error types:
  • authentication_error - Invalid or missing API key
  • invalid_request_error - Malformed request
  • unavailable_error - Server not ready (model loading)
  • not_supported_error - Feature not enabled (e.g., metrics endpoint)

Next Steps