Overview
llama-server is a fast, lightweight HTTP server for serving LLM models with an OpenAI-compatible API. Built on pure C/C++ with minimal dependencies, it provides enterprise-grade features like parallel decoding, continuous batching, and multi-user support.
Quick Start
http://localhost:8080 with a web UI accessible via browser.
Key Features
- OpenAI API Compatible: Drop-in replacement for OpenAI chat completions and embeddings
- Anthropic Messages API: Compatible with Claude API format
- Parallel Decoding: Multi-user support with continuous batching
- Multimodal: Process images and audio through API endpoints
- Reranking: Built-in reranking endpoint for search applications
- Function Calling: Tool use support for compatible models
- Speculative Decoding: Accelerated generation with draft models
- Web UI: Built-in interface for testing and debugging
Server Configuration
Basic Server Options
IP address to bind to. Use
0.0.0.0 to allow external connections.Can also bind to a UNIX socket by ending the address with .sock.Environment: LLAMA_ARG_HOSTPort to listen on.Environment:
LLAMA_ARG_PORTPath to serve static files from.Environment:
LLAMA_ARG_STATIC_PATHPrefix path the server serves from (without trailing slash).Environment:
LLAMA_ARG_API_PREFIXModel Loading
Path to the GGUF model file.Environment:
LLAMA_ARG_MODELHugging Face repository in format
<user>/<model>[:quant].Automatically downloads mmproj for multimodal models unless disabled with --no-mmproj.Example: unsloth/phi-4-GGUF:q4_k_mEnvironment: LLAMA_ARG_HF_REPOModel name aliases (comma-separated) to be used by API.Environment:
LLAMA_ARG_ALIASParallel Processing
Number of parallel slots (concurrent requests).
-1 means auto.Environment: LLAMA_ARG_N_PARALLELSize of the prompt context.
0 loads from model.For parallel requests, multiply by number of slots. Example: -c 16384 -np 4 supports 4 concurrent requests with 4096 context each.Environment: LLAMA_ARG_CTX_SIZEEnable continuous batching (dynamic batching) for efficient parallel processing.Environment:
LLAMA_ARG_CONT_BATCHINGAuthentication & Security
API key for authentication. Multiple keys can be comma-separated.Environment:
LLAMA_API_KEYPath to file containing API keys (one per line).
Path to PEM-encoded SSL private key for HTTPS.Environment:
LLAMA_ARG_SSL_KEY_FILEPath to PEM-encoded SSL certificate for HTTPS.Environment:
LLAMA_ARG_SSL_CERT_FILEUsage Examples
Starting the Server
Docker Deployment
Docker Compose
API Endpoints
Health Check
GET/health or /v1/health
Public endpoint (no API key required).
Chat Completions (OpenAI Compatible)
POST/v1/chat/completions
Completions (Non-OAI Format)
POST/completion
Llama.cpp native completion endpoint with extended features.
Embeddings
POST/v1/embeddings
Generate embeddings with embedding models:
cURL
Reranking
POST/reranking
Rerank documents for search applications:
Advanced Configuration
Multimodal Support
Serve vision or audio models:/v1/chat/completions endpoint accepts images in base64 format:
Monitoring Endpoints
Enable Prometheus-compatible metrics endpoint at
/metrics.Environment: LLAMA_ARG_ENDPOINT_METRICSExpose slot monitoring endpoint for viewing active requests.Environment:
LLAMA_ARG_ENDPOINT_SLOTSEnable POST /props endpoint for changing global properties.Environment:
LLAMA_ARG_ENDPOINT_PROPSGrammar & JSON Schemas
Constrain all outputs with a grammar:Caching & Performance
Enable prompt caching to reuse KV cache from previous requests.Environment:
LLAMA_ARG_CACHE_PROMPTMinimum chunk size to attempt reusing from cache via KV shifting.Requires prompt caching to be enabled.Environment:
LLAMA_ARG_CACHE_REUSEHow much a request prompt must match a slot’s prompt to reuse that slot.
0.0 disables this feature.Router Mode
Serve multiple models simultaneously:Directory containing models for router server.Environment:
LLAMA_ARG_MODELS_DIRMaximum number of models to load simultaneously.
0 = unlimited.Environment: LLAMA_ARG_MODELS_MAXAutomatically load models on demand.Environment:
LLAMA_ARG_MODELS_AUTOLOADTimeout & Throttling
Server read/write timeout in seconds.Environment:
LLAMA_ARG_TIMEOUTNumber of threads to process HTTP requests.Environment:
LLAMA_ARG_THREADS_HTTPSeconds of idleness before server sleeps to save resources.
-1 disables.Web UI Configuration
Enable the built-in web interface.Environment:
LLAMA_ARG_WEBUIJSON configuration for WebUI defaults.Environment:
LLAMA_ARG_WEBUI_CONFIGPath to JSON file with WebUI configuration.Environment:
LLAMA_ARG_WEBUI_CONFIG_FILEEnvironment Variables
Boolean options use these values:- Enabled:
true,1,on,enabled - Disabled:
false,0,off,disabled - Negation:
LLAMA_ARG_NO_MMAPdisables mmap regardless of value
Performance Optimization
Best Practices
- Use
--cont-batchingfor multiple concurrent users - Enable
--cache-promptto reuse computation across similar requests - Set
--cache-reusefor improved performance with shared prefixes - Use
--flash-attn onon supported hardware - Adjust
-np(parallel slots) based on your concurrency needs - Monitor with
--metricsendpoint for production deployments
Building with SSL
To enable HTTPS support:See Also
- llama-cli - Interactive command-line interface
- llama-bench - Performance benchmarking
- Server Development Guide

