Skip to main content
The llama-server provides a production-ready HTTP API server with OpenAI-compatible endpoints for chat completions, embeddings, and more.

Features

  • OpenAI-compatible /v1/chat/completions and /v1/embeddings endpoints
  • Anthropic Messages API compatibility
  • Parallel decoding with multi-user support
  • Continuous batching for optimal throughput
  • Multimodal support (vision and audio)
  • Web UI for interactive testing
  • Reranking endpoint
  • Function calling / tool use
  • Speculative decoding

Quick Start

1

Start the server

Launch llama-server with your model:
./llama-server -m models/model.gguf -c 2048
The server will listen on http://127.0.0.1:8080 by default.
2

Access the Web UI

Open your browser and navigate to:
http://127.0.0.1:8080
3

Test the API

Make a request to the chat completions endpoint:
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Starting the Server

Basic Configuration

# Start with default settings
./llama-server -m models/model.gguf

# Custom host and port
./llama-server -m models/model.gguf --host 0.0.0.0 --port 8080

# With GPU acceleration
./llama-server -m models/model.gguf -ngl 99

Docker

# Basic Docker run
docker run -p 8080:8080 -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:server \
  -m models/model.gguf -c 512 --host 0.0.0.0 --port 8080

# With CUDA GPU support
docker run -p 8080:8080 -v /path/to/models:/models --gpus all \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m models/model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99

Docker Compose

services:
  llamacpp-server:
    image: ghcr.io/ggml-org/llama.cpp:server
    ports:
      - 8080:8080
    volumes:
      - ./models:/models
    environment:
      LLAMA_ARG_MODEL: /models/my_model.gguf
      LLAMA_ARG_CTX_SIZE: 4096
      LLAMA_ARG_N_PARALLEL: 2
      LLAMA_ARG_ENDPOINT_METRICS: 1
      LLAMA_ARG_PORT: 8080
For boolean environment variables like LLAMA_ARG_MMAP, use values: true/1/on/enabled or false/0/off/disabled

Chat Completions API

OpenAI-compatible endpoint at /v1/chat/completions.

Basic Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Streaming

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

With Temperature and Top-P

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Write a poem"}],
    "temperature": 0.9,
    "top_p": 0.95,
    "max_tokens": 200
  }'

JSON Mode

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Generate a user profile"}],
    "response_format": {"type": "json_object"}
  }'
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

Completion API

Non-OpenAI-compatible endpoint at /completion for raw text completion.
curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Building a website can be done in 10 simple steps:",
    "n_predict": 128,
    "temperature": 0.7,
    "top_p": 0.9
  }'

With Streaming

curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Once upon a time",
    "stream": true,
    "n_predict": 100
  }'

Multiple Prompts

curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": ["First prompt", "Second prompt", "Third prompt"],
    "n_predict": 50
  }'

Server Configuration

Parallel Processing

# Support multiple simultaneous users (default: auto)
./llama-server -m model.gguf -np 4

# Enable continuous batching (default: enabled)
./llama-server -m model.gguf -cb

Context and Caching

# Set context size
./llama-server -m model.gguf -c 4096

# Enable prompt caching (default: enabled)
./llama-server -m model.gguf --cache-prompt

# Enable KV cache reuse via shifting
./llama-server -m model.gguf --cache-reuse 256

Batch Processing

# Configure batch sizes for throughput
./llama-server -m model.gguf -b 2048 -ub 512

Authentication

API Key

# Single API key
./llama-server -m model.gguf --api-key "your-secret-key"

# Multiple API keys (comma-separated)
./llama-server -m model.gguf --api-key "key1,key2,key3"

# Load keys from file
./llama-server -m model.gguf --api-key-file keys.txt

Making Authenticated Requests

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-key" \
  -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}]}'

SSL/TLS Configuration

# Build with SSL support
cmake -B build -DLLAMA_OPENSSL=ON
cmake --build build --config Release -t llama-server

# Run with SSL certificates
./llama-server -m model.gguf \
  --ssl-key-file server-key.pem \
  --ssl-cert-file server-cert.pem

Monitoring Endpoints

Health Check

# Check if server is ready
curl http://localhost:8080/health

# Response when ready:
# {"status": "ok"}

# Response when loading:
# {"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}

Slots Monitoring

# View slot status (enabled by default)
curl http://localhost:8080/slots

Prometheus Metrics

# Enable metrics endpoint
./llama-server -m model.gguf --metrics

# Access metrics
curl http://localhost:8080/metrics

Properties

# Get server properties
curl http://localhost:8080/props

# Enable POST to modify properties
./llama-server -m model.gguf --props

Model Management

Model Aliases

# Set model aliases for API routing
./llama-server -m model.gguf -a "gpt-4,gpt-4-turbo"

Model Tags

# Add informational tags
./llama-server -m model.gguf --tags "instruct,chat,7b"

Router Server Mode

# Serve multiple models from a directory
./llama-server --models-dir ./models --models-max 4

# With model presets
./llama-server --models-dir ./models --models-preset presets.ini

# Disable autoload
./llama-server --models-dir ./models --no-models-autoload

Advanced Features

Sleeping on Idle

# Put server to sleep after idle period (saves resources)
./llama-server -m model.gguf --sleep-idle-seconds 300

Slot Prompt Similarity

# Reuse slots with similar prompts (0.0-1.0, 0.0 = disabled)
./llama-server -m model.gguf -sps 0.5

Context Checkpoints

# Set max checkpoints per slot for state saving
./llama-server -m model.gguf --ctx-checkpoints 8

Slot Persistence

# Save slot KV cache to disk
./llama-server -m model.gguf --slot-save-path ./cache

Static File Serving

# Serve static files from directory
./llama-server -m model.gguf --path ./public

# Disable Web UI
./llama-server -m model.gguf --no-webui

# Custom Web UI config
./llama-server -m model.gguf --webui-config-file config.json

Utility Endpoints

Tokenization

# Tokenize text
curl http://localhost:8080/tokenize \
  -H "Content-Type: application/json" \
  -d '{"content": "Hello world!"}'

# With token pieces
curl http://localhost:8080/tokenize \
  -H "Content-Type: application/json" \
  -d '{"content": "Hello world!", "with_pieces": true}'

Detokenization

# Convert tokens back to text
curl http://localhost:8080/detokenize \
  -H "Content-Type: application/json" \
  -d '{"tokens": [123, 456, 789]}'

Apply Chat Template

# Format messages using model's chat template
curl http://localhost:8080/apply-template \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are helpful."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Code Infill

For code completion models with fill-in-the-middle (FIM) support:
curl http://localhost:8080/infill \
  -H "Content-Type: application/json" \
  -d '{
    "input_prefix": "def fibonacci(n):\n    ",
    "input_suffix": "\n    return result",
    "prompt": "# Calculate fibonacci"
  }'

With Repository Context

curl http://localhost:8080/infill \
  -H "Content-Type: application/json" \
  -d '{
    "input_prefix": "def process_data():\n    ",
    "input_suffix": "\n    return result",
    "input_extra": [
      {"filename": "utils.py", "text": "def helper(): ..."},
      {"filename": "config.py", "text": "API_KEY = ..."}
    ]
  }'

LoRA Adapters

Loading LoRA

# Load LoRA adapters at startup
./llama-server -m model.gguf --lora adapter1.gguf,adapter2.gguf

# With scaling
./llama-server -m model.gguf --lora-scaled "adapter1.gguf:0.5,adapter2.gguf:1.0"

# Load without applying (apply via API)
./llama-server -m model.gguf --lora adapter.gguf --lora-init-without-apply

Managing LoRA via API

# Apply LoRA to specific request
curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Specialized task",
    "lora": [{"id": 0, "scale": 0.8}, {"id": 1, "scale": 1.2}]
  }'
Requests with different LoRA configurations won’t be batched together, which may affect throughput.

Performance Tuning

Threading

# Set threads for generation and batch processing
./llama-server -m model.gguf -t 8 -tb 16

# HTTP request threads
./llama-server -m model.gguf --threads-http 4

GPU Configuration

# Offload layers to GPU
./llama-server -m model.gguf -ngl 99

# Split across multiple GPUs
./llama-server -m model.gguf -ngl 99 -sm layer -ts 3,1

# Specify devices
./llama-server -m model.gguf -dev cuda:0,cuda:1

Memory Management

# Force model to stay in RAM
./llama-server -m model.gguf --mlock

# Disable memory mapping
./llama-server -m model.gguf --no-mmap

# Set cache RAM limit (MiB, -1 = unlimited)
./llama-server -m model.gguf -cram 8192

KV Cache Optimization

# Quantize KV cache
./llama-server -m model.gguf -ctk q8_0 -ctv q8_0

# Use unified KV buffer
./llama-server -m model.gguf -kvu

# Disable KV offload
./llama-server -m model.gguf -nkvo

Logging

# Set verbosity level (0-4)
./llama-server -m model.gguf -lv 3

# Enable verbose logging
./llama-server -m model.gguf -v

# Log to file
export LLAMA_LOG_FILE="server.log"
./llama-server -m model.gguf

# Enable timestamps
./llama-server -m model.gguf --log-timestamps

# Colored logs (auto/on/off)
./llama-server -m model.gguf --log-colors auto

Timeout Configuration

# Set read/write timeout (seconds)
./llama-server -m model.gguf --timeout 600

# Time limit for text generation (milliseconds)
curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Generate",
    "t_max_predict_ms": 5000
  }'

Building from Source

# Build with CMake
cmake -B build
cmake --build build --config Release -t llama-server

# Binary location
./build/bin/llama-server

With SSL Support

cmake -B build -DLLAMA_OPENSSL=ON
cmake --build build --config Release -t llama-server

Common Configurations

High-Throughput Server

./llama-server -m model.gguf \
  -c 4096 \
  -np 8 \
  -b 2048 \
  -ub 512 \
  -ngl 99 \
  -t 16 \
  --cache-prompt \
  --cache-reuse 256 \
  --host 0.0.0.0 \
  --port 8080

Low-Latency Server

./llama-server -m model.gguf \
  -c 2048 \
  -np 2 \
  -b 512 \
  -ub 256 \
  -ngl 99 \
  -fa on \
  --cache-prompt \
  --host 0.0.0.0 \
  --port 8080

Development Server

./llama-server -m model.gguf \
  -c 2048 \
  -ngl 99 \
  -v \
  --metrics \
  --props \
  --log-timestamps \
  --host 127.0.0.1 \
  --port 8080

See Also

CLI Tool

Command-line inference interface

Embeddings

Generate text embeddings

Multimodal

Vision and audio support

Speculative Decoding

Accelerate with draft models