Docker

Docker is the recommended installation method. It avoids manual dependency management and supports the full feature set — GPU inference, document Q&A, vision models, voice STT/TTS, and image generation. All public h2oGPT images are hosted in Google Container Registry. These images require CUDA 12.1 or higher on the host.

Set up Docker

Linux
Windows
macOS

Install Docker

sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository -y "deb [arch=amd64] https://download.docker.com/linux/ubuntu jammy stable"
apt-cache policy docker-ce
sudo apt install -y docker-ce
sudo systemctl status docker

Replace jammy with focal if you are on Ubuntu 20.04.

Add your user to the docker group

sudo usermod -aG docker $USER
newgrp docker

This avoids requiring sudo for every docker command. Alternatively, reboot.

Install the NVIDIA Container Toolkit (GPU only)

Skip this step if you are running CPU-only inference.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit-base
sudo apt install -y nvidia-container-runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify that nvidia-smi works inside Docker:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Pull the h2oGPT image

Ensure you have the latest image before running:

docker pull gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1

Run h2oGPT

Create required directories and API key file

mkdir -p ~/.cache/huggingface/hub/
mkdir -p ~/.triton/cache/
mkdir -p ~/.config/vllm/
mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
echo '["key1","key2"]' > ~/h2ogpt_auth/h2ogpt_api_keys.json

Run the container

The example below runs the Zephyr 7B Beta model with GPU access, Gradio on port 7860, and the OpenAI API on port 5000.

export GRADIO_SERVER_PORT=7860
export OPENAI_SERVER_PORT=5000

docker run \
  --gpus all \
  --runtime=nvidia \
  --shm-size=2g \
  -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
  -p $OPENAI_SERVER_PORT:$OPENAI_SERVER_PORT \
  --rm --init \
  --network host \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -u `id -u`:`id -g` \
  -v "${HOME}"/.cache/huggingface/hub/:/workspace/.cache/huggingface/hub \
  -v "${HOME}"/.config:/workspace/.config/ \
  -v "${HOME}"/.triton:/workspace/.triton/ \
  -v "${HOME}"/save:/workspace/save \
  -v "${HOME}"/user_path:/workspace/user_path \
  -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
  -v "${HOME}"/users:/workspace/users \
  -v "${HOME}"/db_nonusers:/workspace/db_nonusers \
  -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
  -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
  -e GRADIO_SERVER_PORT=$GRADIO_SERVER_PORT \
  gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1 /workspace/generate.py \
    --base_model=HuggingFaceH4/zephyr-7b-beta \
    --use_safetensors=True \
    --prompt_type=zephyr \
    --save_dir='/workspace/save/' \
    --auth_filename='/workspace/h2ogpt_auth/auth.db' \
    --h2ogpt_api_keys='/workspace/h2ogpt_auth/h2ogpt_api_keys.json' \
    --use_gpu_id=False \
    --user_path=/workspace/user_path \
    --langchain_mode="LLM" \
    --langchain_modes="['UserData', 'LLM']" \
    --score_model=None \
    --max_max_new_tokens=2048 \
    --max_new_tokens=1024 \
    --use_auth_token="${HUGGING_FACE_HUB_TOKEN}" \
    --openai_port=$OPENAI_SERVER_PORT

Open http://localhost:7860 in your browser.

Add -d after docker run to run in detached background mode. Use docker ps to get the container ID and docker stop <hash> to stop it.

For a single GPU use --gpus '"device=0"'; for two GPUs use --gpus '"device=0,1"' instead of --gpus all. To remove key-based access, delete the --h2ogpt_api_keys line. Change key1 and key2 to real secret values before exposing the server to a network.

Run offline (air-gapped)

Set the offline environment variables and run with pre-cached model weights:

export TRANSFORMERS_OFFLINE=1
export GRADIO_SERVER_PORT=7860
export OPENAI_SERVER_PORT=5000
export HF_HUB_OFFLINE=1

docker run --gpus all \
  --runtime=nvidia \
  --shm-size=2g \
  -e TRANSFORMERS_OFFLINE=$TRANSFORMERS_OFFLINE \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  -e HF_HUB_OFFLINE=$HF_HUB_OFFLINE \
  -e HF_HOME="/workspace/.cache/huggingface/" \
  -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
  -p $OPENAI_SERVER_PORT:$OPENAI_SERVER_PORT \
  --rm --init \
  --network host \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -u `id -u`:`id -g` \
  -v "${HOME}"/.cache/huggingface/:/workspace/.cache/huggingface \
  -v "${HOME}"/save:/workspace/save \
  -v "${HOME}"/user_path:/workspace/user_path \
  -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
  -v "${HOME}"/users:/workspace/users \
  -v "${HOME}"/db_nonusers:/workspace/db_nonusers \
  -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
  -e GRADIO_SERVER_PORT=$GRADIO_SERVER_PORT \
  gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1 \
  /workspace/generate.py \
    --base_model=mistralai/Mistral-7B-Instruct-v0.2 \
    --use_safetensors=False \
    --prompt_type=mistral \
    --save_dir='/workspace/save/' \
    --use_gpu_id=False \
    --user_path=/workspace/user_path \
    --langchain_mode="LLM" \
    --langchain_modes="['UserData', 'MyData', 'LLM']" \
    --score_model=None \
    --max_max_new_tokens=2048 \
    --max_new_tokens=1024 \
    --openai_port=$OPENAI_SERVER_PORT \
    --gradio_offline_level=2

Run with vLLM

You can run vLLM in one container and h2oGPT in another on the same host.

Start the vLLM inference server

The example below runs h2oai/h2ogpt-4096-llama2-7b-chat on two GPUs:

unset CUDA_VISIBLE_DEVICES
mkdir -p $HOME/.cache/huggingface/hub
mkdir -p $HOME/.triton/cache/
mkdir -p $HOME/.config/vllm

docker run \
  --runtime=nvidia \
  --gpus '"device=0,1"' \
  --shm-size=10.24gb \
  -p 5000:5000 \
  --rm --init \
  --network host \
  -e NCCL_IGNORE_DISABLED_P2P=1 \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  -e VLLM_NO_USAGE_STATS=1 \
  -u `id -u`:`id -g` \
  -v "${HOME}"/.cache:$HOME/.cache/ \
  -v "${HOME}"/.config:$HOME/.config/ \
  vllm/vllm-openai:latest \
    --port=5000 \
    --host=0.0.0.0 \
    --model=h2oai/h2ogpt-4096-llama2-7b-chat \
    --tokenizer=hf-internal-testing/llama-tokenizer \
    --tensor-parallel-size=2 \
    --seed 1234 \
    --trust-remote-code

Connect h2oGPT to the vLLM server

Add --inference_server="vllm:0.0.0.0:5000" to the h2oGPT docker run command, and set --base_model to match the model loaded in vLLM.

Verify the vLLM endpoint

curl http://localhost:5000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "h2oai/h2ogpt-4096-llama2-7b-chat",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
  }'

Docker Compose

(Optional) Edit model settings

Open docker-compose.yml and update the environment block to set your desired model and weights.

Build and start

docker-compose up -d --build

Open the UI

Navigate to https://localhost:7860.

View logs

docker-compose logs -f

Tear down

docker-compose down --volumes --rmi all

Build a custom image

The GCR contains nightly and released images for x86. To build your own image after local changes (for example, to enable Metal support for GGUF files on macOS):

touch build_info.txt
docker build -t h2ogpt .

Then replace gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1 with h2ogpt:latest in any docker run command.

For Metal M1/M2 GGUF support, change CMAKE_ARGS in docker_build_script_ubuntu.sh to -DLLAMA_METAL=on and remove the GGML_CUDA=1 line before building.

Get Started

Core Features

Models & Backends

Advanced Usage

Help

Set up Docker

Pull the h2oGPT image

Run h2oGPT

Run offline (air-gapped)

Run with vLLM

Docker Compose

Build a custom image

Build docs developers (and LLMs) love

Get Started

Core Features

Models & Backends

Advanced Usage

Help

​Set up Docker

​Pull the h2oGPT image

​Run h2oGPT

​Run offline (air-gapped)

​Run with vLLM

​Docker Compose

​Build a custom image

Build docs developers (and LLMs) love

Set up Docker

Pull the h2oGPT image

Run h2oGPT

Run offline (air-gapped)

Run with vLLM

Docker Compose

Build a custom image