Docker Deployment

Docker provides an easy way to run llama.cpp without building from source, with support for CPU and various GPU backends.

Prerequisites

Docker must be installed and running on your system
Create a folder to store models and intermediate files (e.g., /llama/models)

Available Images

llama.cpp provides pre-built Docker images in three variants:

Full

Complete toolset including CLI, conversion tools, and quantization

Light

Only llama-cli and llama-completion executables

Server

Only llama-server for API deployment

CPU Images

ghcr.io/ggml-org/llama.cpp:full
ghcr.io/ggml-org/llama.cpp:light
ghcr.io/ggml-org/llama.cpp:server

Platforms: linux/amd64, linux/arm64, linux/s390x

GPU Images

CUDA (NVIDIA)
ROCm (AMD)
SYCL (Intel)
Vulkan
MUSA (Moore Threads)

ghcr.io/ggml-org/llama.cpp:full-cuda
ghcr.io/ggml-org/llama.cpp:light-cuda
ghcr.io/ggml-org/llama.cpp:server-cuda

Platform: linux/amd64

ghcr.io/ggml-org/llama.cpp:full-rocm
ghcr.io/ggml-org/llama.cpp:light-rocm
ghcr.io/ggml-org/llama.cpp:server-rocm

Platforms: linux/amd64, linux/arm64

ghcr.io/ggml-org/llama.cpp:full-intel
ghcr.io/ggml-org/llama.cpp:light-intel
ghcr.io/ggml-org/llama.cpp:server-intel

Platform: linux/amd64

ghcr.io/ggml-org/llama.cpp:full-vulkan
ghcr.io/ggml-org/llama.cpp:light-vulkan
ghcr.io/ggml-org/llama.cpp:server-vulkan

Platform: linux/amd64

ghcr.io/ggml-org/llama.cpp:full-musa
ghcr.io/ggml-org/llama.cpp:light-musa
ghcr.io/ggml-org/llama.cpp:server-musa

Platform: linux/amd64

GPU-enabled images are not currently tested by CI beyond being built. If you need different settings (e.g., different CUDA version), you’ll need to build locally.

Quick Start

Run CLI Interactive

docker run -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:light \
  -m /models/model.gguf -p "Hello, world!"

Run Server

docker run -v /path/to/models:/models -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/model.gguf --port 8080 --host 0.0.0.0

Access the API at http://localhost:8080

All-in-One Conversion

The full image includes model conversion tools:

docker run -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:full \
  --all-in-one "/models/" 7B

GPU Acceleration

NVIDIA GPU (CUDA)

Requires nvidia-container-toolkit installed.

docker run --gpus all -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/model.gguf \
  --n-gpu-layers 32 \
  --port 8080 --host 0.0.0.0

AMD GPU (ROCm)

docker run --device=/dev/kfd --device=/dev/dri \
  -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-rocm \
  -m /models/model.gguf \
  --n-gpu-layers 32

Docker Compose

Create a docker-compose.yml file:

docker-compose.yml

version: '3.8'

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    volumes:
      - ./models:/models
    ports:
      - "8080:8080"
    command: >
      -m /models/model.gguf
      --port 8080
      --host 0.0.0.0
      --n-gpu-layers 32
      -c 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Run with:

docker compose up -d

Building Locally

Build CPU Image

docker build -t local/llama.cpp:full \
  --target full \
  -f .devops/full.Dockerfile .

Build CUDA Image

docker build -t local/llama.cpp:full-cuda \
  --target full \
  --build-arg CUDA_VERSION=12.4.0 \
  --build-arg CUDA_DOCKER_ARCH=all \
  -f .devops/cuda.Dockerfile .

Build Arguments

CUDA_VERSION: CUDA version to use (default: 12.4.0)CUDA_DOCKER_ARCH: Target GPU architectures (default: all)Specify specific architectures for smaller images:

--build-arg CUDA_DOCKER_ARCH="70;75;80;86"

ROCm Build

docker build -t local/llama.cpp:server-rocm \
  --target server \
  -f .devops/rocm.Dockerfile .

Vulkan Build

docker build -t local/llama.cpp:server-vulkan \
  --target server \
  -f .devops/vulkan.Dockerfile .

Production Deployment

Health Check

Add health checks to your Docker configuration:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

Resource Limits

deploy:
  resources:
    limits:
      cpus: '4'
      memory: 16G
    reservations:
      cpus: '2'
      memory: 8G

Environment Variables

environment:
  - LLAMA_ARG_THREADS=8
  - LLAMA_ARG_CTX_SIZE=4096
  - LLAMA_ARG_N_GPU_LAYERS=32

Kubernetes Deployment

Example Kubernetes Deployment

k8s-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-server
  template:
    metadata:
      labels:
        app: llama-server
    spec:
      containers:
      - name: llama-server
        image: ghcr.io/ggml-org/llama.cpp:server-cuda
        args:
          - "-m"
          - "/models/model.gguf"
          - "--port"
          - "8080"
          - "--host"
          - "0.0.0.0"
          - "--n-gpu-layers"
          - "32"
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: models
          mountPath: /models
        resources:
          limits:
            nvidia.com/gpu: 1
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: llama-models
---
apiVersion: v1
kind: Service
metadata:
  name: llama-server
spec:
  selector:
    app: llama-server
  ports:
  - port: 8080
    targetPort: 8080
  type: LoadBalancer

Troubleshooting

GPU not detected in container

Ensure nvidia-container-toolkit is installed and configured

Check nvidia-smi works inside container:

docker run --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Verify --gpus all flag is set

Out of memory errors

Reduce context size: -c 2048
Reduce GPU layers: --n-gpu-layers 16
Use smaller quantization: Q4_K_M instead of Q8_0
Increase Docker memory limits

Permission denied errors

Check volume mount paths exist and are readable

Run with user permissions:

docker run --user $(id -u):$(id -g) ...

Get Started

Core Concepts

Inference

Models

Advanced

Docker Deployment

Docker Deployment

Prerequisites

Available Images

Full

Light

Server

CPU Images

GPU Images

Quick Start

Run CLI Interactive

Run Server

All-in-One Conversion

GPU Acceleration

NVIDIA GPU (CUDA)

AMD GPU (ROCm)

Docker Compose

Building Locally

Build CPU Image

Build CUDA Image

Production Deployment

Health Check

Resource Limits

Environment Variables

Kubernetes Deployment

Troubleshooting

Next Steps

Server Configuration

REST API

Get Started

Core Concepts

Inference

Models

Advanced

​Docker Deployment

​Prerequisites

​Available Images

Full

Light

Server

​CPU Images

​GPU Images

​Quick Start

​Run CLI Interactive

​Run Server

​All-in-One Conversion

​GPU Acceleration

​NVIDIA GPU (CUDA)

​AMD GPU (ROCm)

​Docker Compose

​Building Locally

​Build CPU Image

​Build CUDA Image

​Production Deployment

​Health Check

​Resource Limits

​Environment Variables

​Kubernetes Deployment

​Troubleshooting

​Next Steps

Server Configuration

REST API

Docker Deployment

Prerequisites

Available Images

CPU Images

GPU Images

Quick Start

Run CLI Interactive

Run Server

All-in-One Conversion

GPU Acceleration

NVIDIA GPU (CUDA)

AMD GPU (ROCm)

Docker Compose

Building Locally

Build CPU Image

Build CUDA Image

Production Deployment

Health Check

Resource Limits

Environment Variables

Kubernetes Deployment

Troubleshooting

Next Steps