Features
- OpenAI-compatible API: Drop-in replacement for OpenAI’s API endpoints
- High Performance: Pure C/C++ implementation for maximum speed
- GPU Acceleration: Support for CUDA, Metal, and other backends
- Streaming Responses: Real-time token generation with Server-Sent Events
- Multiple Models: Router mode for managing multiple models simultaneously
- Multimodal Support: Vision and audio capabilities (experimental)
- Function Calling: Tool use support for compatible models
- Flexible Deployment: Docker, native binaries, or cloud platforms
Quick Start
Starting the Server
http://127.0.0.1:8080 by default.
Common Server Arguments
Path to the model file (GGUF format)
Size of the prompt context (0 = loaded from model)
Number of tokens to predict (-1 = infinity)
Number of layers to store in VRAM (auto, all, or specific number)
IP address to bind to
Port to listen on
Number of parallel slots for concurrent requests (-1 = auto)
API key for authentication (can be comma-separated list for multiple keys)
Authentication
To enable API key authentication, start the server with the--api-key flag:
Without
--api-key, the server runs in open mode. The health endpoint (/health) is always public regardless of authentication settings.Using with OpenAI Client Libraries
The llama.cpp server is compatible with OpenAI’s client libraries:Available Endpoints
OpenAI-Compatible Endpoints
- POST /v1/chat/completions - Chat-based text generation
- POST /v1/completions - Text completion
- POST /v1/embeddings - Generate text embeddings
- GET /v1/models - List available models
Native llama.cpp Endpoints
- POST /completion - Native completion endpoint (not OAI-compatible)
- POST /embedding - Native embeddings endpoint (not OAI-compatible)
- POST /tokenize - Tokenize text
- POST /detokenize - Convert tokens to text
- GET /health - Health check endpoint
- GET /props - Server properties and configuration
- GET /slots - Monitor slot status and performance
Additional Features
- POST /infill - Code infilling for completion
- POST /reranking - Document reranking
- GET /metrics - Prometheus-compatible metrics (requires
--metricsflag)
Model Configuration
Setting Model Alias
By default, the model ID is the file path. You can set a custom alias:Downloading Models from Hugging Face
Health Check
Check if the server is ready:200 OKwith{"status": "ok"}- Server is ready503 Service Unavailablewith error message - Model is still loading
Environment Variables
Many arguments can be configured via environment variables:Error Handling
The server returns OpenAI-compatible error responses:authentication_error- Invalid or missing API keyinvalid_request_error- Malformed requestunavailable_error- Server not ready (model loading)not_supported_error- Feature not enabled (e.g., metrics endpoint)
Next Steps
- Chat Completions API - Interactive chat with models
- Completions API - Simple text completion
- Embeddings API - Generate vector embeddings

