llama-server

Overview

llama-server is a fast, lightweight HTTP server for serving LLM models with an OpenAI-compatible API. Built on pure C/C++ with minimal dependencies, it provides enterprise-grade features like parallel decoding, continuous batching, and multi-user support.

Quick Start

llama-server -m model.gguf --port 8080

The server will start on http://localhost:8080 with a web UI accessible via browser.

Key Features

OpenAI API Compatible: Drop-in replacement for OpenAI chat completions and embeddings
Anthropic Messages API: Compatible with Claude API format
Parallel Decoding: Multi-user support with continuous batching
Multimodal: Process images and audio through API endpoints
Reranking: Built-in reranking endpoint for search applications
Function Calling: Tool use support for compatible models
Speculative Decoding: Accelerated generation with draft models
Web UI: Built-in interface for testing and debugging

Server Configuration

Basic Server Options

--host

string

default:"127.0.0.1"

IP address to bind to. Use 0.0.0.0 to allow external connections.Can also bind to a UNIX socket by ending the address with .sock.Environment: LLAMA_ARG_HOST

--port

integer

default:"8080"

Port to listen on.Environment: LLAMA_ARG_PORT

--path

string

Path to serve static files from.Environment: LLAMA_ARG_STATIC_PATH

--api-prefix

string

Prefix path the server serves from (without trailing slash).Environment: LLAMA_ARG_API_PREFIX

Model Loading

-m, --model

string

Path to the GGUF model file.Environment: LLAMA_ARG_MODEL

-hf, --hf-repo

string

Hugging Face repository in format <user>/<model>[:quant].Automatically downloads mmproj for multimodal models unless disabled with --no-mmproj.Example: unsloth/phi-4-GGUF:q4_k_mEnvironment: LLAMA_ARG_HF_REPO

-a, --alias

string

Model name aliases (comma-separated) to be used by API.Environment: LLAMA_ARG_ALIAS

Parallel Processing

-np, --parallel

integer

default:"-1"

Number of parallel slots (concurrent requests). -1 means auto.Environment: LLAMA_ARG_N_PARALLEL

-c, --ctx-size

integer

default:"0"

Size of the prompt context. 0 loads from model.For parallel requests, multiply by number of slots. Example: -c 16384 -np 4 supports 4 concurrent requests with 4096 context each.Environment: LLAMA_ARG_CTX_SIZE

-cb, --cont-batching

boolean

default:"true"

Enable continuous batching (dynamic batching) for efficient parallel processing.Environment: LLAMA_ARG_CONT_BATCHING

Authentication & Security

--api-key

string

API key for authentication. Multiple keys can be comma-separated.Environment: LLAMA_API_KEY

--api-key-file

string

Path to file containing API keys (one per line).

--ssl-key-file

string

Path to PEM-encoded SSL private key for HTTPS.Environment: LLAMA_ARG_SSL_KEY_FILE

--ssl-cert-file

string

Path to PEM-encoded SSL certificate for HTTPS.Environment: LLAMA_ARG_SSL_CERT_FILE

Usage Examples

Starting the Server

Basic startup

Start with default configuration:

llama-server -m model.gguf --port 8080

Access the web UI at http://localhost:8080

Multiple concurrent users

Support up to 4 concurrent requests:

# 4 slots × 4096 tokens = 16384 total context
llama-server -m model.gguf -c 16384 -np 4

Enable speculative decoding

Use a draft model for faster generation:

llama-server -m model.gguf -md draft.gguf

Docker Deployment

docker run -p 8080:8080 -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:server \
  -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080

Docker Compose

services:
  llamacpp-server:
    image: ghcr.io/ggml-org/llama.cpp:server
    ports:
      - 8080:8080
    volumes:
      - ./models:/models
    environment:
      LLAMA_ARG_MODEL: /models/my_model.gguf
      LLAMA_ARG_CTX_SIZE: 4096
      LLAMA_ARG_N_PARALLEL: 2
      LLAMA_ARG_ENDPOINT_METRICS: 1
      LLAMA_ARG_PORT: 8080

API Endpoints

Health Check

GET /health or /v1/health Public endpoint (no API key required).

curl http://localhost:8080/health

Chat Completions (OpenAI Compatible)

POST /v1/chat/completions

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7
  }'

Completions (Non-OAI Format)

POST /completion Llama.cpp native completion endpoint with extended features.

curl --request POST \
  --url http://localhost:8080/completion \
  --header "Content-Type: application/json" \
  --data '{
    "prompt": "Building a website can be done in 10 simple steps:",
    "n_predict": 128,
    "temperature": 0.8,
    "top_k": 40,
    "top_p": 0.95
  }'

Embeddings

POST /v1/embeddings Generate embeddings with embedding models:

llama-server -m embedding-model.gguf --embedding --pooling cls -ub 8192

cURL

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "The quick brown fox jumps over the lazy dog.",
    "model": "text-embedding-ada-002"
  }'

Reranking

POST /reranking Rerank documents for search applications:

llama-server -m reranking-model.gguf --reranking

Advanced Configuration

Multimodal Support

Serve vision or audio models:

llama-server -m vision-model.gguf \
  -mm vision-projector.gguf \
  --image-max-tokens 1024

The /v1/chat/completions endpoint accepts images in base64 format:

{
  "model": "gpt-4-vision",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ]
}

Monitoring Endpoints

--metrics

boolean

default:"false"

Enable Prometheus-compatible metrics endpoint at /metrics.Environment: LLAMA_ARG_ENDPOINT_METRICS

--slots

boolean

default:"true"

Expose slot monitoring endpoint for viewing active requests.Environment: LLAMA_ARG_ENDPOINT_SLOTS

--props

boolean

default:"false"

Enable POST /props endpoint for changing global properties.Environment: LLAMA_ARG_ENDPOINT_PROPS

Grammar & JSON Schemas

Constrain all outputs with a grammar:

# Custom grammar
llama-server -m model.gguf --grammar-file grammar.gbnf

# JSON output
llama-server -m model.gguf --grammar-file grammars/json.gbnf

Clients can also specify grammars per-request in the API.

Caching & Performance

--cache-prompt

boolean

default:"true"

Enable prompt caching to reuse KV cache from previous requests.Environment: LLAMA_ARG_CACHE_PROMPT

--cache-reuse

integer

default:"0"

Minimum chunk size to attempt reusing from cache via KV shifting.Requires prompt caching to be enabled.Environment: LLAMA_ARG_CACHE_REUSE

-sps, --slot-prompt-similarity

float

default:"0.1"

How much a request prompt must match a slot’s prompt to reuse that slot.0.0 disables this feature.

Router Mode

Serve multiple models simultaneously:

--models-dir

string

Directory containing models for router server.Environment: LLAMA_ARG_MODELS_DIR

--models-max

integer

default:"4"

Maximum number of models to load simultaneously. 0 = unlimited.Environment: LLAMA_ARG_MODELS_MAX

--models-autoload

boolean

default:"true"

Automatically load models on demand.Environment: LLAMA_ARG_MODELS_AUTOLOAD

Timeout & Throttling

-to, --timeout

integer

default:"600"

Server read/write timeout in seconds.Environment: LLAMA_ARG_TIMEOUT

--threads-http

integer

default:"-1"

Number of threads to process HTTP requests.Environment: LLAMA_ARG_THREADS_HTTP

--sleep-idle-seconds

integer

default:"-1"

Seconds of idleness before server sleeps to save resources. -1 disables.

Web UI Configuration

--webui

boolean

default:"true"

Enable the built-in web interface.Environment: LLAMA_ARG_WEBUI

--webui-config

json

JSON configuration for WebUI defaults.Environment: LLAMA_ARG_WEBUI_CONFIG

--webui-config-file

string

Path to JSON file with WebUI configuration.Environment: LLAMA_ARG_WEBUI_CONFIG_FILE

Environment Variables

Boolean options use these values:

Enabled: true, 1, on, enabled
Disabled: false, 0, off, disabled
Negation: LLAMA_ARG_NO_MMAP disables mmap regardless of value

Example:

export LLAMA_ARG_MODEL=/models/my_model.gguf
export LLAMA_ARG_CTX_SIZE=4096
export LLAMA_ARG_N_PARALLEL=2
export LLAMA_ARG_ENDPOINT_METRICS=1
export LLAMA_ARG_MMAP=true

llama-server

Performance Optimization

Best Practices

Use --cont-batching for multiple concurrent users
Enable --cache-prompt to reuse computation across similar requests
Set --cache-reuse for improved performance with shared prefixes
Use --flash-attn on on supported hardware
Adjust -np (parallel slots) based on your concurrency needs
Monitor with --metrics endpoint for production deployments

Building with SSL

To enable HTTPS support:

cmake -B build -DLLAMA_OPENSSL=ON
cmake --build build --config Release -t llama-server

Then use with SSL certificates:

llama-server -m model.gguf \
  --ssl-key-file server-key.pem \
  --ssl-cert-file server-cert.pem \
  --port 8443

C/C++ API

REST API

Tools

Overview

Quick Start

Key Features

Server Configuration

Basic Server Options

Model Loading

Parallel Processing

Authentication & Security

Usage Examples

Starting the Server

Docker Deployment

Docker Compose

API Endpoints

Health Check

Chat Completions (OpenAI Compatible)

Completions (Non-OAI Format)

Embeddings

Reranking

Advanced Configuration

Multimodal Support

Monitoring Endpoints

Grammar & JSON Schemas

Caching & Performance

Router Mode

Timeout & Throttling

Web UI Configuration

Environment Variables

Performance Optimization

Building with SSL

See Also

C/C++ API

REST API

Tools

​Overview

​Quick Start

​Key Features

​Server Configuration

​Basic Server Options

​Model Loading

​Parallel Processing

​Authentication & Security

​Usage Examples

​Starting the Server

​Docker Deployment

​Docker Compose

​API Endpoints

​Health Check

​Chat Completions (OpenAI Compatible)

​Completions (Non-OAI Format)

​Embeddings

​Reranking

​Advanced Configuration

​Multimodal Support

​Monitoring Endpoints

​Grammar & JSON Schemas

​Caching & Performance

​Router Mode

​Timeout & Throttling

​Web UI Configuration

​Environment Variables

​Performance Optimization

​Building with SSL

​See Also

Overview

Quick Start

Key Features

Server Configuration

Basic Server Options

Model Loading

Parallel Processing

Authentication & Security

Usage Examples

Starting the Server

Docker Deployment

Docker Compose

API Endpoints

Health Check

Chat Completions (OpenAI Compatible)

Completions (Non-OAI Format)

Embeddings

Reranking

Advanced Configuration

Multimodal Support

Monitoring Endpoints

Grammar & JSON Schemas

Caching & Performance

Router Mode

Timeout & Throttling

Web UI Configuration

Environment Variables

Performance Optimization

Building with SSL

See Also