Skip to main content

Docker Deployment

Docker provides an easy way to run llama.cpp without building from source, with support for CPU and various GPU backends.

Prerequisites

  • Docker must be installed and running on your system
  • Create a folder to store models and intermediate files (e.g., /llama/models)

Available Images

llama.cpp provides pre-built Docker images in three variants:

Full

Complete toolset including CLI, conversion tools, and quantization

Light

Only llama-cli and llama-completion executables

Server

Only llama-server for API deployment

CPU Images

ghcr.io/ggml-org/llama.cpp:full
ghcr.io/ggml-org/llama.cpp:light
ghcr.io/ggml-org/llama.cpp:server
Platforms: linux/amd64, linux/arm64, linux/s390x

GPU Images

ghcr.io/ggml-org/llama.cpp:full-cuda
ghcr.io/ggml-org/llama.cpp:light-cuda
ghcr.io/ggml-org/llama.cpp:server-cuda
Platform: linux/amd64
GPU-enabled images are not currently tested by CI beyond being built. If you need different settings (e.g., different CUDA version), you’ll need to build locally.

Quick Start

Run CLI Interactive

docker run -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:light \
  -m /models/model.gguf -p "Hello, world!"

Run Server

docker run -v /path/to/models:/models -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/model.gguf --port 8080 --host 0.0.0.0
Access the API at http://localhost:8080

All-in-One Conversion

The full image includes model conversion tools:
docker run -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:full \
  --all-in-one "/models/" 7B

GPU Acceleration

NVIDIA GPU (CUDA)

Requires nvidia-container-toolkit installed.
docker run --gpus all -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/model.gguf \
  --n-gpu-layers 32 \
  --port 8080 --host 0.0.0.0

AMD GPU (ROCm)

docker run --device=/dev/kfd --device=/dev/dri \
  -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-rocm \
  -m /models/model.gguf \
  --n-gpu-layers 32

Docker Compose

Create a docker-compose.yml file:
docker-compose.yml
version: '3.8'

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    volumes:
      - ./models:/models
    ports:
      - "8080:8080"
    command: >
      -m /models/model.gguf
      --port 8080
      --host 0.0.0.0
      --n-gpu-layers 32
      -c 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
Run with:
docker compose up -d

Building Locally

Build CPU Image

docker build -t local/llama.cpp:full \
  --target full \
  -f .devops/full.Dockerfile .

Build CUDA Image

docker build -t local/llama.cpp:full-cuda \
  --target full \
  --build-arg CUDA_VERSION=12.4.0 \
  --build-arg CUDA_DOCKER_ARCH=all \
  -f .devops/cuda.Dockerfile .
CUDA_VERSION: CUDA version to use (default: 12.4.0)CUDA_DOCKER_ARCH: Target GPU architectures (default: all)Specify specific architectures for smaller images:
--build-arg CUDA_DOCKER_ARCH="70;75;80;86"
docker build -t local/llama.cpp:server-rocm \
  --target server \
  -f .devops/rocm.Dockerfile .
docker build -t local/llama.cpp:server-vulkan \
  --target server \
  -f .devops/vulkan.Dockerfile .

Production Deployment

Health Check

Add health checks to your Docker configuration:
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

Resource Limits

deploy:
  resources:
    limits:
      cpus: '4'
      memory: 16G
    reservations:
      cpus: '2'
      memory: 8G

Environment Variables

environment:
  - LLAMA_ARG_THREADS=8
  - LLAMA_ARG_CTX_SIZE=4096
  - LLAMA_ARG_N_GPU_LAYERS=32

Kubernetes Deployment

k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-server
  template:
    metadata:
      labels:
        app: llama-server
    spec:
      containers:
      - name: llama-server
        image: ghcr.io/ggml-org/llama.cpp:server-cuda
        args:
          - "-m"
          - "/models/model.gguf"
          - "--port"
          - "8080"
          - "--host"
          - "0.0.0.0"
          - "--n-gpu-layers"
          - "32"
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: models
          mountPath: /models
        resources:
          limits:
            nvidia.com/gpu: 1
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: llama-models
---
apiVersion: v1
kind: Service
metadata:
  name: llama-server
spec:
  selector:
    app: llama-server
  ports:
  - port: 8080
    targetPort: 8080
  type: LoadBalancer

Troubleshooting

  • Ensure nvidia-container-toolkit is installed and configured
  • Check nvidia-smi works inside container:
    docker run --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
    
  • Verify --gpus all flag is set
  • Reduce context size: -c 2048
  • Reduce GPU layers: --n-gpu-layers 16
  • Use smaller quantization: Q4_K_M instead of Q8_0
  • Increase Docker memory limits
  • Check volume mount paths exist and are readable
  • Run with user permissions:
    docker run --user $(id -u):$(id -g) ...
    

Next Steps

Server Configuration

Learn about server options and configuration

REST API

Use the OpenAI-compatible API