Skip to main content
Docker is the recommended installation method. It avoids manual dependency management and supports the full feature set — GPU inference, document Q&A, vision models, voice STT/TTS, and image generation. All public h2oGPT images are hosted in Google Container Registry. These images require CUDA 12.1 or higher on the host.

Set up Docker

1

Install Docker

sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository -y "deb [arch=amd64] https://download.docker.com/linux/ubuntu jammy stable"
apt-cache policy docker-ce
sudo apt install -y docker-ce
sudo systemctl status docker
Replace jammy with focal if you are on Ubuntu 20.04.
2

Add your user to the docker group

sudo usermod -aG docker $USER
newgrp docker
This avoids requiring sudo for every docker command. Alternatively, reboot.
3

Install the NVIDIA Container Toolkit (GPU only)

Skip this step if you are running CPU-only inference.
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit-base
sudo apt install -y nvidia-container-runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify that nvidia-smi works inside Docker:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Pull the h2oGPT image

Ensure you have the latest image before running:
docker pull gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1

Run h2oGPT

1

Create required directories and API key file

mkdir -p ~/.cache/huggingface/hub/
mkdir -p ~/.triton/cache/
mkdir -p ~/.config/vllm/
mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
echo '["key1","key2"]' > ~/h2ogpt_auth/h2ogpt_api_keys.json
2

Run the container

The example below runs the Zephyr 7B Beta model with GPU access, Gradio on port 7860, and the OpenAI API on port 5000.
export GRADIO_SERVER_PORT=7860
export OPENAI_SERVER_PORT=5000

docker run \
  --gpus all \
  --runtime=nvidia \
  --shm-size=2g \
  -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
  -p $OPENAI_SERVER_PORT:$OPENAI_SERVER_PORT \
  --rm --init \
  --network host \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -u `id -u`:`id -g` \
  -v "${HOME}"/.cache/huggingface/hub/:/workspace/.cache/huggingface/hub \
  -v "${HOME}"/.config:/workspace/.config/ \
  -v "${HOME}"/.triton:/workspace/.triton/ \
  -v "${HOME}"/save:/workspace/save \
  -v "${HOME}"/user_path:/workspace/user_path \
  -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
  -v "${HOME}"/users:/workspace/users \
  -v "${HOME}"/db_nonusers:/workspace/db_nonusers \
  -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
  -v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
  -e GRADIO_SERVER_PORT=$GRADIO_SERVER_PORT \
  gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1 /workspace/generate.py \
    --base_model=HuggingFaceH4/zephyr-7b-beta \
    --use_safetensors=True \
    --prompt_type=zephyr \
    --save_dir='/workspace/save/' \
    --auth_filename='/workspace/h2ogpt_auth/auth.db' \
    --h2ogpt_api_keys='/workspace/h2ogpt_auth/h2ogpt_api_keys.json' \
    --use_gpu_id=False \
    --user_path=/workspace/user_path \
    --langchain_mode="LLM" \
    --langchain_modes="['UserData', 'LLM']" \
    --score_model=None \
    --max_max_new_tokens=2048 \
    --max_new_tokens=1024 \
    --use_auth_token="${HUGGING_FACE_HUB_TOKEN}" \
    --openai_port=$OPENAI_SERVER_PORT
Open http://localhost:7860 in your browser.
Add -d after docker run to run in detached background mode. Use docker ps to get the container ID and docker stop <hash> to stop it.
For a single GPU use --gpus '"device=0"'; for two GPUs use --gpus '"device=0,1"' instead of --gpus all. To remove key-based access, delete the --h2ogpt_api_keys line. Change key1 and key2 to real secret values before exposing the server to a network.

Run offline (air-gapped)

Set the offline environment variables and run with pre-cached model weights:
export TRANSFORMERS_OFFLINE=1
export GRADIO_SERVER_PORT=7860
export OPENAI_SERVER_PORT=5000
export HF_HUB_OFFLINE=1

docker run --gpus all \
  --runtime=nvidia \
  --shm-size=2g \
  -e TRANSFORMERS_OFFLINE=$TRANSFORMERS_OFFLINE \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  -e HF_HUB_OFFLINE=$HF_HUB_OFFLINE \
  -e HF_HOME="/workspace/.cache/huggingface/" \
  -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
  -p $OPENAI_SERVER_PORT:$OPENAI_SERVER_PORT \
  --rm --init \
  --network host \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -u `id -u`:`id -g` \
  -v "${HOME}"/.cache/huggingface/:/workspace/.cache/huggingface \
  -v "${HOME}"/save:/workspace/save \
  -v "${HOME}"/user_path:/workspace/user_path \
  -v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
  -v "${HOME}"/users:/workspace/users \
  -v "${HOME}"/db_nonusers:/workspace/db_nonusers \
  -v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
  -e GRADIO_SERVER_PORT=$GRADIO_SERVER_PORT \
  gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1 \
  /workspace/generate.py \
    --base_model=mistralai/Mistral-7B-Instruct-v0.2 \
    --use_safetensors=False \
    --prompt_type=mistral \
    --save_dir='/workspace/save/' \
    --use_gpu_id=False \
    --user_path=/workspace/user_path \
    --langchain_mode="LLM" \
    --langchain_modes="['UserData', 'MyData', 'LLM']" \
    --score_model=None \
    --max_max_new_tokens=2048 \
    --max_new_tokens=1024 \
    --openai_port=$OPENAI_SERVER_PORT \
    --gradio_offline_level=2

Run with vLLM

You can run vLLM in one container and h2oGPT in another on the same host.
1

Start the vLLM inference server

The example below runs h2oai/h2ogpt-4096-llama2-7b-chat on two GPUs:
unset CUDA_VISIBLE_DEVICES
mkdir -p $HOME/.cache/huggingface/hub
mkdir -p $HOME/.triton/cache/
mkdir -p $HOME/.config/vllm

docker run \
  --runtime=nvidia \
  --gpus '"device=0,1"' \
  --shm-size=10.24gb \
  -p 5000:5000 \
  --rm --init \
  --network host \
  -e NCCL_IGNORE_DISABLED_P2P=1 \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  -e VLLM_NO_USAGE_STATS=1 \
  -u `id -u`:`id -g` \
  -v "${HOME}"/.cache:$HOME/.cache/ \
  -v "${HOME}"/.config:$HOME/.config/ \
  vllm/vllm-openai:latest \
    --port=5000 \
    --host=0.0.0.0 \
    --model=h2oai/h2ogpt-4096-llama2-7b-chat \
    --tokenizer=hf-internal-testing/llama-tokenizer \
    --tensor-parallel-size=2 \
    --seed 1234 \
    --trust-remote-code
2

Connect h2oGPT to the vLLM server

Add --inference_server="vllm:0.0.0.0:5000" to the h2oGPT docker run command, and set --base_model to match the model loaded in vLLM.
3

Verify the vLLM endpoint

curl http://localhost:5000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "h2oai/h2ogpt-4096-llama2-7b-chat",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
  }'

Docker Compose

1

(Optional) Edit model settings

Open docker-compose.yml and update the environment block to set your desired model and weights.
2

Build and start

docker-compose up -d --build
3

Open the UI

4

View logs

docker-compose logs -f
5

Tear down

docker-compose down --volumes --rmi all

Build a custom image

The GCR contains nightly and released images for x86. To build your own image after local changes (for example, to enable Metal support for GGUF files on macOS):
touch build_info.txt
docker build -t h2ogpt .
Then replace gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1 with h2ogpt:latest in any docker run command.
For Metal M1/M2 GGUF support, change CMAKE_ARGS in docker_build_script_ubuntu.sh to -DLLAMA_METAL=on and remove the GGML_CUDA=1 line before building.

Build docs developers (and LLMs) love