The llama-server provides a production-ready HTTP API server with OpenAI-compatible endpoints for chat completions, embeddings, and more.
Features
OpenAI-compatible /v1/chat/completions and /v1/embeddings endpoints
Anthropic Messages API compatibility
Parallel decoding with multi-user support
Continuous batching for optimal throughput
Multimodal support (vision and audio)
Web UI for interactive testing
Reranking endpoint
Function calling / tool use
Speculative decoding
Quick Start
Start the server
Launch llama-server with your model: ./llama-server -m models/model.gguf -c 2048
The server will listen on http://127.0.0.1:8080 by default.
Access the Web UI
Open your browser and navigate to:
Test the API
Make a request to the chat completions endpoint: curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Starting the Server
Basic Configuration
# Start with default settings
./llama-server -m models/model.gguf
# Custom host and port
./llama-server -m models/model.gguf --host 0.0.0.0 --port 8080
# With GPU acceleration
./llama-server -m models/model.gguf -ngl 99
Docker
# Basic Docker run
docker run -p 8080:8080 -v /path/to/models:/models \
ghcr.io/ggml-org/llama.cpp:server \
-m models/model.gguf -c 512 --host 0.0.0.0 --port 8080
# With CUDA GPU support
docker run -p 8080:8080 -v /path/to/models:/models --gpus all \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m models/model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99
Docker Compose
services :
llamacpp-server :
image : ghcr.io/ggml-org/llama.cpp:server
ports :
- 8080:8080
volumes :
- ./models:/models
environment :
LLAMA_ARG_MODEL : /models/my_model.gguf
LLAMA_ARG_CTX_SIZE : 4096
LLAMA_ARG_N_PARALLEL : 2
LLAMA_ARG_ENDPOINT_METRICS : 1
LLAMA_ARG_PORT : 8080
For boolean environment variables like LLAMA_ARG_MMAP, use values: true/1/on/enabled or false/0/off/disabled
Chat Completions API
OpenAI-compatible endpoint at /v1/chat/completions.
Basic Request
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
}'
Streaming
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Count to 5"}],
"stream": true
}'
With Temperature and Top-P
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Write a poem"}],
"temperature": 0.9,
"top_p": 0.95,
"max_tokens": 200
}'
JSON Mode
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Generate a user profile"}],
"response_format": {"type": "json_object"}
}'
import openai
client = openai.OpenAI(
base_url = "http://localhost:8080/v1" ,
api_key = "not-needed"
)
response = client.chat.completions.create(
model = "gpt-4" ,
messages = [
{ "role" : "system" , "content" : "You are a helpful assistant." },
{ "role" : "user" , "content" : "Hello!" }
]
)
print (response.choices[ 0 ].message.content)
Completion API
Non-OpenAI-compatible endpoint at /completion for raw text completion.
curl http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Building a website can be done in 10 simple steps:",
"n_predict": 128,
"temperature": 0.7,
"top_p": 0.9
}'
With Streaming
curl http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Once upon a time",
"stream": true,
"n_predict": 100
}'
Multiple Prompts
curl http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": ["First prompt", "Second prompt", "Third prompt"],
"n_predict": 50
}'
Server Configuration
Parallel Processing
# Support multiple simultaneous users (default: auto)
./llama-server -m model.gguf -np 4
# Enable continuous batching (default: enabled)
./llama-server -m model.gguf -cb
Context and Caching
# Set context size
./llama-server -m model.gguf -c 4096
# Enable prompt caching (default: enabled)
./llama-server -m model.gguf --cache-prompt
# Enable KV cache reuse via shifting
./llama-server -m model.gguf --cache-reuse 256
Batch Processing
# Configure batch sizes for throughput
./llama-server -m model.gguf -b 2048 -ub 512
Authentication
API Key
# Single API key
./llama-server -m model.gguf --api-key "your-secret-key"
# Multiple API keys (comma-separated)
./llama-server -m model.gguf --api-key "key1,key2,key3"
# Load keys from file
./llama-server -m model.gguf --api-key-file keys.txt
Making Authenticated Requests
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-key" \
-d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}]}'
SSL/TLS Configuration
# Build with SSL support
cmake -B build -DLLAMA_OPENSSL=ON
cmake --build build --config Release -t llama-server
# Run with SSL certificates
./llama-server -m model.gguf \
--ssl-key-file server-key.pem \
--ssl-cert-file server-cert.pem
Monitoring Endpoints
Health Check
# Check if server is ready
curl http://localhost:8080/health
# Response when ready:
# {"status": "ok"}
# Response when loading:
# {"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}
Slots Monitoring
# View slot status (enabled by default)
curl http://localhost:8080/slots
Prometheus Metrics
# Enable metrics endpoint
./llama-server -m model.gguf --metrics
# Access metrics
curl http://localhost:8080/metrics
Properties
# Get server properties
curl http://localhost:8080/props
# Enable POST to modify properties
./llama-server -m model.gguf --props
Model Management
Model Aliases
# Set model aliases for API routing
./llama-server -m model.gguf -a "gpt-4,gpt-4-turbo"
# Add informational tags
./llama-server -m model.gguf --tags "instruct,chat,7b"
Router Server Mode
# Serve multiple models from a directory
./llama-server --models-dir ./models --models-max 4
# With model presets
./llama-server --models-dir ./models --models-preset presets.ini
# Disable autoload
./llama-server --models-dir ./models --no-models-autoload
Advanced Features
Sleeping on Idle
# Put server to sleep after idle period (saves resources)
./llama-server -m model.gguf --sleep-idle-seconds 300
Slot Prompt Similarity
# Reuse slots with similar prompts (0.0-1.0, 0.0 = disabled)
./llama-server -m model.gguf -sps 0.5
Context Checkpoints
# Set max checkpoints per slot for state saving
./llama-server -m model.gguf --ctx-checkpoints 8
Slot Persistence
# Save slot KV cache to disk
./llama-server -m model.gguf --slot-save-path ./cache
Static File Serving
# Serve static files from directory
./llama-server -m model.gguf --path ./public
# Disable Web UI
./llama-server -m model.gguf --no-webui
# Custom Web UI config
./llama-server -m model.gguf --webui-config-file config.json
Utility Endpoints
Tokenization
# Tokenize text
curl http://localhost:8080/tokenize \
-H "Content-Type: application/json" \
-d '{"content": "Hello world!"}'
# With token pieces
curl http://localhost:8080/tokenize \
-H "Content-Type: application/json" \
-d '{"content": "Hello world!", "with_pieces": true}'
Detokenization
# Convert tokens back to text
curl http://localhost:8080/detokenize \
-H "Content-Type: application/json" \
-d '{"tokens": [123, 456, 789]}'
Apply Chat Template
# Format messages using model's chat template
curl http://localhost:8080/apply-template \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"}
]
}'
Code Infill
For code completion models with fill-in-the-middle (FIM) support:
curl http://localhost:8080/infill \
-H "Content-Type: application/json" \
-d '{
"input_prefix": "def fibonacci(n):\n ",
"input_suffix": "\n return result",
"prompt": "# Calculate fibonacci"
}'
With Repository Context
curl http://localhost:8080/infill \
-H "Content-Type: application/json" \
-d '{
"input_prefix": "def process_data():\n ",
"input_suffix": "\n return result",
"input_extra": [
{"filename": "utils.py", "text": "def helper(): ..."},
{"filename": "config.py", "text": "API_KEY = ..."}
]
}'
LoRA Adapters
Loading LoRA
# Load LoRA adapters at startup
./llama-server -m model.gguf --lora adapter1.gguf,adapter2.gguf
# With scaling
./llama-server -m model.gguf --lora-scaled "adapter1.gguf:0.5,adapter2.gguf:1.0"
# Load without applying (apply via API)
./llama-server -m model.gguf --lora adapter.gguf --lora-init-without-apply
Managing LoRA via API
# Apply LoRA to specific request
curl http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Specialized task",
"lora": [{"id": 0, "scale": 0.8}, {"id": 1, "scale": 1.2}]
}'
Requests with different LoRA configurations won’t be batched together, which may affect throughput.
Threading
# Set threads for generation and batch processing
./llama-server -m model.gguf -t 8 -tb 16
# HTTP request threads
./llama-server -m model.gguf --threads-http 4
GPU Configuration
# Offload layers to GPU
./llama-server -m model.gguf -ngl 99
# Split across multiple GPUs
./llama-server -m model.gguf -ngl 99 -sm layer -ts 3,1
# Specify devices
./llama-server -m model.gguf -dev cuda:0,cuda:1
Memory Management
# Force model to stay in RAM
./llama-server -m model.gguf --mlock
# Disable memory mapping
./llama-server -m model.gguf --no-mmap
# Set cache RAM limit (MiB, -1 = unlimited)
./llama-server -m model.gguf -cram 8192
KV Cache Optimization
# Quantize KV cache
./llama-server -m model.gguf -ctk q8_0 -ctv q8_0
# Use unified KV buffer
./llama-server -m model.gguf -kvu
# Disable KV offload
./llama-server -m model.gguf -nkvo
Logging
# Set verbosity level (0-4)
./llama-server -m model.gguf -lv 3
# Enable verbose logging
./llama-server -m model.gguf -v
# Log to file
export LLAMA_LOG_FILE = "server.log"
./llama-server -m model.gguf
# Enable timestamps
./llama-server -m model.gguf --log-timestamps
# Colored logs (auto/on/off)
./llama-server -m model.gguf --log-colors auto
Timeout Configuration
# Set read/write timeout (seconds)
./llama-server -m model.gguf --timeout 600
# Time limit for text generation (milliseconds)
curl http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Generate",
"t_max_predict_ms": 5000
}'
Building from Source
# Build with CMake
cmake -B build
cmake --build build --config Release -t llama-server
# Binary location
./build/bin/llama-server
With SSL Support
cmake -B build -DLLAMA_OPENSSL=ON
cmake --build build --config Release -t llama-server
Common Configurations
High-Throughput Server
./llama-server -m model.gguf \
-c 4096 \
-np 8 \
-b 2048 \
-ub 512 \
-ngl 99 \
-t 16 \
--cache-prompt \
--cache-reuse 256 \
--host 0.0.0.0 \
--port 8080
Low-Latency Server
./llama-server -m model.gguf \
-c 2048 \
-np 2 \
-b 512 \
-ub 256 \
-ngl 99 \
-fa on \
--cache-prompt \
--host 0.0.0.0 \
--port 8080
Development Server
./llama-server -m model.gguf \
-c 2048 \
-ngl 99 \
-v \
--metrics \
--props \
--log-timestamps \
--host 127.0.0.1 \
--port 8080
See Also
CLI Tool Command-line inference interface
Embeddings Generate text embeddings
Multimodal Vision and audio support
Speculative Decoding Accelerate with draft models