Server (llama-server)

The llama-server provides a production-ready HTTP API server with OpenAI-compatible endpoints for chat completions, embeddings, and more.

Features

OpenAI-compatible /v1/chat/completions and /v1/embeddings endpoints
Anthropic Messages API compatibility
Parallel decoding with multi-user support
Continuous batching for optimal throughput
Multimodal support (vision and audio)
Web UI for interactive testing
Reranking endpoint
Function calling / tool use
Speculative decoding

Quick Start

Start the server

Launch llama-server with your model:

./llama-server -m models/model.gguf -c 2048

The server will listen on http://127.0.0.1:8080 by default.

Access the Web UI

Open your browser and navigate to:

http://127.0.0.1:8080

Test the API

Make a request to the chat completions endpoint:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Starting the Server

Basic Configuration

# Start with default settings
./llama-server -m models/model.gguf

# Custom host and port
./llama-server -m models/model.gguf --host 0.0.0.0 --port 8080

# With GPU acceleration
./llama-server -m models/model.gguf -ngl 99

Docker

# Basic Docker run
docker run -p 8080:8080 -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:server \
  -m models/model.gguf -c 512 --host 0.0.0.0 --port 8080

# With CUDA GPU support
docker run -p 8080:8080 -v /path/to/models:/models --gpus all \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m models/model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99

Docker Compose

services:
  llamacpp-server:
    image: ghcr.io/ggml-org/llama.cpp:server
    ports:
      - 8080:8080
    volumes:
      - ./models:/models
    environment:
      LLAMA_ARG_MODEL: /models/my_model.gguf
      LLAMA_ARG_CTX_SIZE: 4096
      LLAMA_ARG_N_PARALLEL: 2
      LLAMA_ARG_ENDPOINT_METRICS: 1
      LLAMA_ARG_PORT: 8080

For boolean environment variables like LLAMA_ARG_MMAP, use values: true/1/on/enabled or false/0/off/disabled

Chat Completions API

OpenAI-compatible endpoint at /v1/chat/completions.

Basic Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Streaming

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

With Temperature and Top-P

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Write a poem"}],
    "temperature": 0.9,
    "top_p": 0.95,
    "max_tokens": 200
  }'

JSON Mode

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Generate a user profile"}],
    "response_format": {"type": "json_object"}
  }'

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

Completion API

Non-OpenAI-compatible endpoint at /completion for raw text completion.

curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Building a website can be done in 10 simple steps:",
    "n_predict": 128,
    "temperature": 0.7,
    "top_p": 0.9
  }'

With Streaming

curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Once upon a time",
    "stream": true,
    "n_predict": 100
  }'

Multiple Prompts

curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": ["First prompt", "Second prompt", "Third prompt"],
    "n_predict": 50
  }'

Server Configuration

Parallel Processing

# Support multiple simultaneous users (default: auto)
./llama-server -m model.gguf -np 4

# Enable continuous batching (default: enabled)
./llama-server -m model.gguf -cb

Context and Caching

# Set context size
./llama-server -m model.gguf -c 4096

# Enable prompt caching (default: enabled)
./llama-server -m model.gguf --cache-prompt

# Enable KV cache reuse via shifting
./llama-server -m model.gguf --cache-reuse 256

Batch Processing

# Configure batch sizes for throughput
./llama-server -m model.gguf -b 2048 -ub 512

Authentication

API Key

# Single API key
./llama-server -m model.gguf --api-key "your-secret-key"

# Multiple API keys (comma-separated)
./llama-server -m model.gguf --api-key "key1,key2,key3"

# Load keys from file
./llama-server -m model.gguf --api-key-file keys.txt

Making Authenticated Requests

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-key" \
  -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}]}'

SSL/TLS Configuration

# Build with SSL support
cmake -B build -DLLAMA_OPENSSL=ON
cmake --build build --config Release -t llama-server

# Run with SSL certificates
./llama-server -m model.gguf \
  --ssl-key-file server-key.pem \
  --ssl-cert-file server-cert.pem

Monitoring Endpoints

Health Check

# Check if server is ready
curl http://localhost:8080/health

# Response when ready:
# {"status": "ok"}

# Response when loading:
# {"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}

Slots Monitoring

# View slot status (enabled by default)
curl http://localhost:8080/slots

Prometheus Metrics

# Enable metrics endpoint
./llama-server -m model.gguf --metrics

# Access metrics
curl http://localhost:8080/metrics

Properties

# Get server properties
curl http://localhost:8080/props

# Enable POST to modify properties
./llama-server -m model.gguf --props

Model Management

Model Aliases

# Set model aliases for API routing
./llama-server -m model.gguf -a "gpt-4,gpt-4-turbo"

Model Tags

# Add informational tags
./llama-server -m model.gguf --tags "instruct,chat,7b"

Router Server Mode

# Serve multiple models from a directory
./llama-server --models-dir ./models --models-max 4

# With model presets
./llama-server --models-dir ./models --models-preset presets.ini

# Disable autoload
./llama-server --models-dir ./models --no-models-autoload

Advanced Features

Sleeping on Idle

# Put server to sleep after idle period (saves resources)
./llama-server -m model.gguf --sleep-idle-seconds 300

Slot Prompt Similarity

# Reuse slots with similar prompts (0.0-1.0, 0.0 = disabled)
./llama-server -m model.gguf -sps 0.5

Context Checkpoints

# Set max checkpoints per slot for state saving
./llama-server -m model.gguf --ctx-checkpoints 8

Slot Persistence

# Save slot KV cache to disk
./llama-server -m model.gguf --slot-save-path ./cache

Static File Serving

# Serve static files from directory
./llama-server -m model.gguf --path ./public

# Disable Web UI
./llama-server -m model.gguf --no-webui

# Custom Web UI config
./llama-server -m model.gguf --webui-config-file config.json

Utility Endpoints

Tokenization

# Tokenize text
curl http://localhost:8080/tokenize \
  -H "Content-Type: application/json" \
  -d '{"content": "Hello world!"}'

# With token pieces
curl http://localhost:8080/tokenize \
  -H "Content-Type: application/json" \
  -d '{"content": "Hello world!", "with_pieces": true}'

Detokenization

# Convert tokens back to text
curl http://localhost:8080/detokenize \
  -H "Content-Type: application/json" \
  -d '{"tokens": [123, 456, 789]}'

Apply Chat Template

# Format messages using model's chat template
curl http://localhost:8080/apply-template \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are helpful."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Code Infill

For code completion models with fill-in-the-middle (FIM) support:

curl http://localhost:8080/infill \
  -H "Content-Type: application/json" \
  -d '{
    "input_prefix": "def fibonacci(n):\n    ",
    "input_suffix": "\n    return result",
    "prompt": "# Calculate fibonacci"
  }'

With Repository Context

curl http://localhost:8080/infill \
  -H "Content-Type: application/json" \
  -d '{
    "input_prefix": "def process_data():\n    ",
    "input_suffix": "\n    return result",
    "input_extra": [
      {"filename": "utils.py", "text": "def helper(): ..."},
      {"filename": "config.py", "text": "API_KEY = ..."}
    ]
  }'

LoRA Adapters

Loading LoRA

# Load LoRA adapters at startup
./llama-server -m model.gguf --lora adapter1.gguf,adapter2.gguf

# With scaling
./llama-server -m model.gguf --lora-scaled "adapter1.gguf:0.5,adapter2.gguf:1.0"

# Load without applying (apply via API)
./llama-server -m model.gguf --lora adapter.gguf --lora-init-without-apply

Managing LoRA via API

# Apply LoRA to specific request
curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Specialized task",
    "lora": [{"id": 0, "scale": 0.8}, {"id": 1, "scale": 1.2}]
  }'

Requests with different LoRA configurations won’t be batched together, which may affect throughput.

Performance Tuning

Threading

# Set threads for generation and batch processing
./llama-server -m model.gguf -t 8 -tb 16

# HTTP request threads
./llama-server -m model.gguf --threads-http 4

GPU Configuration

# Offload layers to GPU
./llama-server -m model.gguf -ngl 99

# Split across multiple GPUs
./llama-server -m model.gguf -ngl 99 -sm layer -ts 3,1

# Specify devices
./llama-server -m model.gguf -dev cuda:0,cuda:1

Memory Management

# Force model to stay in RAM
./llama-server -m model.gguf --mlock

# Disable memory mapping
./llama-server -m model.gguf --no-mmap

# Set cache RAM limit (MiB, -1 = unlimited)
./llama-server -m model.gguf -cram 8192

KV Cache Optimization

# Quantize KV cache
./llama-server -m model.gguf -ctk q8_0 -ctv q8_0

# Use unified KV buffer
./llama-server -m model.gguf -kvu

# Disable KV offload
./llama-server -m model.gguf -nkvo

Logging

# Set verbosity level (0-4)
./llama-server -m model.gguf -lv 3

# Enable verbose logging
./llama-server -m model.gguf -v

# Log to file
export LLAMA_LOG_FILE="server.log"
./llama-server -m model.gguf

# Enable timestamps
./llama-server -m model.gguf --log-timestamps

# Colored logs (auto/on/off)
./llama-server -m model.gguf --log-colors auto

Timeout Configuration

# Set read/write timeout (seconds)
./llama-server -m model.gguf --timeout 600

# Time limit for text generation (milliseconds)
curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Generate",
    "t_max_predict_ms": 5000
  }'

Building from Source

# Build with CMake
cmake -B build
cmake --build build --config Release -t llama-server

# Binary location
./build/bin/llama-server

With SSL Support

cmake -B build -DLLAMA_OPENSSL=ON
cmake --build build --config Release -t llama-server

Common Configurations

High-Throughput Server

./llama-server -m model.gguf \
  -c 4096 \
  -np 8 \
  -b 2048 \
  -ub 512 \
  -ngl 99 \
  -t 16 \
  --cache-prompt \
  --cache-reuse 256 \
  --host 0.0.0.0 \
  --port 8080

Low-Latency Server

./llama-server -m model.gguf \
  -c 2048 \
  -np 2 \
  -b 512 \
  -ub 256 \
  -ngl 99 \
  -fa on \
  --cache-prompt \
  --host 0.0.0.0 \
  --port 8080

Development Server

./llama-server -m model.gguf \
  -c 2048 \
  -ngl 99 \
  -v \
  --metrics \
  --props \
  --log-timestamps \
  --host 127.0.0.1 \
  --port 8080

CLI Tool

Command-line inference interface

Embeddings

Generate text embeddings

Multimodal

Vision and audio support

Speculative Decoding

Accelerate with draft models

Get Started

Core Concepts

Inference

Models

Advanced

​Features

​Quick Start

​Starting the Server

​Basic Configuration

​Docker

​Docker Compose

​Chat Completions API

​Basic Request

​Streaming

​With Temperature and Top-P

​JSON Mode

​Completion API

​With Streaming

​Multiple Prompts

​Server Configuration

​Parallel Processing

​Context and Caching

​Batch Processing

​Authentication

​API Key

​Making Authenticated Requests

​SSL/TLS Configuration

​Monitoring Endpoints

​Health Check

​Slots Monitoring

​Prometheus Metrics

​Properties

​Model Management

​Model Aliases

​Model Tags

​Router Server Mode

​Advanced Features

​Sleeping on Idle

​Slot Prompt Similarity

​Context Checkpoints

​Slot Persistence

​Static File Serving

​Utility Endpoints

​Tokenization

​Detokenization

​Apply Chat Template

​Code Infill

​With Repository Context

​LoRA Adapters

​Loading LoRA

​Managing LoRA via API

​Performance Tuning

​Threading

​GPU Configuration

​Memory Management

​KV Cache Optimization

​Logging

​Timeout Configuration

​Building from Source

​With SSL Support

​Common Configurations

​High-Throughput Server

​Low-Latency Server

​Development Server

​See Also

CLI Tool

Embeddings

Multimodal

Speculative Decoding

Features

Quick Start

Starting the Server

Basic Configuration

Docker

Docker Compose

Chat Completions API

Basic Request

Streaming

With Temperature and Top-P

JSON Mode

Completion API

With Streaming

Multiple Prompts

Server Configuration

Parallel Processing

Context and Caching

Batch Processing

Authentication

API Key

Making Authenticated Requests

SSL/TLS Configuration

Monitoring Endpoints

Health Check

Slots Monitoring

Prometheus Metrics

Properties

Model Management

Model Aliases

Model Tags

Router Server Mode

Advanced Features

Sleeping on Idle

Slot Prompt Similarity

Context Checkpoints

Slot Persistence

Static File Serving

Utility Endpoints

Tokenization

Detokenization

Apply Chat Template

Code Infill

With Repository Context

LoRA Adapters

Loading LoRA

Managing LoRA via API

Performance Tuning

Threading

GPU Configuration

Memory Management

KV Cache Optimization

Logging

Timeout Configuration

Building from Source

With SSL Support

Common Configurations

High-Throughput Server

Low-Latency Server

Development Server

See Also