Skip to main content

Overview

llama-server is a fast, lightweight HTTP server for serving LLM models with an OpenAI-compatible API. Built on pure C/C++ with minimal dependencies, it provides enterprise-grade features like parallel decoding, continuous batching, and multi-user support.

Quick Start

llama-server -m model.gguf --port 8080
The server will start on http://localhost:8080 with a web UI accessible via browser.

Key Features

  • OpenAI API Compatible: Drop-in replacement for OpenAI chat completions and embeddings
  • Anthropic Messages API: Compatible with Claude API format
  • Parallel Decoding: Multi-user support with continuous batching
  • Multimodal: Process images and audio through API endpoints
  • Reranking: Built-in reranking endpoint for search applications
  • Function Calling: Tool use support for compatible models
  • Speculative Decoding: Accelerated generation with draft models
  • Web UI: Built-in interface for testing and debugging

Server Configuration

Basic Server Options

--host
string
default:"127.0.0.1"
IP address to bind to. Use 0.0.0.0 to allow external connections.Can also bind to a UNIX socket by ending the address with .sock.Environment: LLAMA_ARG_HOST
--port
integer
default:"8080"
Port to listen on.Environment: LLAMA_ARG_PORT
--path
string
Path to serve static files from.Environment: LLAMA_ARG_STATIC_PATH
--api-prefix
string
Prefix path the server serves from (without trailing slash).Environment: LLAMA_ARG_API_PREFIX

Model Loading

-m, --model
string
Path to the GGUF model file.Environment: LLAMA_ARG_MODEL
-hf, --hf-repo
string
Hugging Face repository in format <user>/<model>[:quant].Automatically downloads mmproj for multimodal models unless disabled with --no-mmproj.Example: unsloth/phi-4-GGUF:q4_k_mEnvironment: LLAMA_ARG_HF_REPO
-a, --alias
string
Model name aliases (comma-separated) to be used by API.Environment: LLAMA_ARG_ALIAS

Parallel Processing

-np, --parallel
integer
default:"-1"
Number of parallel slots (concurrent requests). -1 means auto.Environment: LLAMA_ARG_N_PARALLEL
-c, --ctx-size
integer
default:"0"
Size of the prompt context. 0 loads from model.For parallel requests, multiply by number of slots. Example: -c 16384 -np 4 supports 4 concurrent requests with 4096 context each.Environment: LLAMA_ARG_CTX_SIZE
-cb, --cont-batching
boolean
default:"true"
Enable continuous batching (dynamic batching) for efficient parallel processing.Environment: LLAMA_ARG_CONT_BATCHING

Authentication & Security

--api-key
string
API key for authentication. Multiple keys can be comma-separated.Environment: LLAMA_API_KEY
--api-key-file
string
Path to file containing API keys (one per line).
--ssl-key-file
string
Path to PEM-encoded SSL private key for HTTPS.Environment: LLAMA_ARG_SSL_KEY_FILE
--ssl-cert-file
string
Path to PEM-encoded SSL certificate for HTTPS.Environment: LLAMA_ARG_SSL_CERT_FILE

Usage Examples

Starting the Server

1

Basic startup

Start with default configuration:
llama-server -m model.gguf --port 8080
Access the web UI at http://localhost:8080
2

Multiple concurrent users

Support up to 4 concurrent requests:
# 4 slots × 4096 tokens = 16384 total context
llama-server -m model.gguf -c 16384 -np 4
3

Enable speculative decoding

Use a draft model for faster generation:
llama-server -m model.gguf -md draft.gguf

Docker Deployment

docker run -p 8080:8080 -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:server \
  -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080

Docker Compose

services:
  llamacpp-server:
    image: ghcr.io/ggml-org/llama.cpp:server
    ports:
      - 8080:8080
    volumes:
      - ./models:/models
    environment:
      LLAMA_ARG_MODEL: /models/my_model.gguf
      LLAMA_ARG_CTX_SIZE: 4096
      LLAMA_ARG_N_PARALLEL: 2
      LLAMA_ARG_ENDPOINT_METRICS: 1
      LLAMA_ARG_PORT: 8080

API Endpoints

Health Check

GET /health or /v1/health Public endpoint (no API key required).
curl http://localhost:8080/health

Chat Completions (OpenAI Compatible)

POST /v1/chat/completions
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7
  }'

Completions (Non-OAI Format)

POST /completion Llama.cpp native completion endpoint with extended features.
curl --request POST \
  --url http://localhost:8080/completion \
  --header "Content-Type: application/json" \
  --data '{
    "prompt": "Building a website can be done in 10 simple steps:",
    "n_predict": 128,
    "temperature": 0.8,
    "top_k": 40,
    "top_p": 0.95
  }'

Embeddings

POST /v1/embeddings Generate embeddings with embedding models:
llama-server -m embedding-model.gguf --embedding --pooling cls -ub 8192
cURL
curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "The quick brown fox jumps over the lazy dog.",
    "model": "text-embedding-ada-002"
  }'

Reranking

POST /reranking Rerank documents for search applications:
llama-server -m reranking-model.gguf --reranking

Advanced Configuration

Multimodal Support

Serve vision or audio models:
llama-server -m vision-model.gguf \
  -mm vision-projector.gguf \
  --image-max-tokens 1024
The /v1/chat/completions endpoint accepts images in base64 format:
{
  "model": "gpt-4-vision",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ]
}

Monitoring Endpoints

--metrics
boolean
default:"false"
Enable Prometheus-compatible metrics endpoint at /metrics.Environment: LLAMA_ARG_ENDPOINT_METRICS
--slots
boolean
default:"true"
Expose slot monitoring endpoint for viewing active requests.Environment: LLAMA_ARG_ENDPOINT_SLOTS
--props
boolean
default:"false"
Enable POST /props endpoint for changing global properties.Environment: LLAMA_ARG_ENDPOINT_PROPS

Grammar & JSON Schemas

Constrain all outputs with a grammar:
# Custom grammar
llama-server -m model.gguf --grammar-file grammar.gbnf

# JSON output
llama-server -m model.gguf --grammar-file grammars/json.gbnf
Clients can also specify grammars per-request in the API.

Caching & Performance

--cache-prompt
boolean
default:"true"
Enable prompt caching to reuse KV cache from previous requests.Environment: LLAMA_ARG_CACHE_PROMPT
--cache-reuse
integer
default:"0"
Minimum chunk size to attempt reusing from cache via KV shifting.Requires prompt caching to be enabled.Environment: LLAMA_ARG_CACHE_REUSE
-sps, --slot-prompt-similarity
float
default:"0.1"
How much a request prompt must match a slot’s prompt to reuse that slot.0.0 disables this feature.

Router Mode

Serve multiple models simultaneously:
--models-dir
string
Directory containing models for router server.Environment: LLAMA_ARG_MODELS_DIR
--models-max
integer
default:"4"
Maximum number of models to load simultaneously. 0 = unlimited.Environment: LLAMA_ARG_MODELS_MAX
--models-autoload
boolean
default:"true"
Automatically load models on demand.Environment: LLAMA_ARG_MODELS_AUTOLOAD

Timeout & Throttling

-to, --timeout
integer
default:"600"
Server read/write timeout in seconds.Environment: LLAMA_ARG_TIMEOUT
--threads-http
integer
default:"-1"
Number of threads to process HTTP requests.Environment: LLAMA_ARG_THREADS_HTTP
--sleep-idle-seconds
integer
default:"-1"
Seconds of idleness before server sleeps to save resources. -1 disables.

Web UI Configuration

--webui
boolean
default:"true"
Enable the built-in web interface.Environment: LLAMA_ARG_WEBUI
--webui-config
json
JSON configuration for WebUI defaults.Environment: LLAMA_ARG_WEBUI_CONFIG
--webui-config-file
string
Path to JSON file with WebUI configuration.Environment: LLAMA_ARG_WEBUI_CONFIG_FILE

Environment Variables

Boolean options use these values:
  • Enabled: true, 1, on, enabled
  • Disabled: false, 0, off, disabled
  • Negation: LLAMA_ARG_NO_MMAP disables mmap regardless of value
Example:
export LLAMA_ARG_MODEL=/models/my_model.gguf
export LLAMA_ARG_CTX_SIZE=4096
export LLAMA_ARG_N_PARALLEL=2
export LLAMA_ARG_ENDPOINT_METRICS=1
export LLAMA_ARG_MMAP=true

llama-server

Performance Optimization

Best Practices
  • Use --cont-batching for multiple concurrent users
  • Enable --cache-prompt to reuse computation across similar requests
  • Set --cache-reuse for improved performance with shared prefixes
  • Use --flash-attn on on supported hardware
  • Adjust -np (parallel slots) based on your concurrency needs
  • Monitor with --metrics endpoint for production deployments

Building with SSL

To enable HTTPS support:
cmake -B build -DLLAMA_OPENSSL=ON
cmake --build build --config Release -t llama-server
Then use with SSL certificates:
llama-server -m model.gguf \
  --ssl-key-file server-key.pem \
  --ssl-cert-file server-cert.pem \
  --port 8443

See Also