Docker is the recommended installation method. It avoids manual dependency management and supports the full feature set — GPU inference, document Q&A, vision models, voice STT/TTS, and image generation.
All public h2oGPT images are hosted in Google Container Registry. These images require CUDA 12.1 or higher on the host.
Set up Docker
Install Docker
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository -y "deb [arch=amd64] https://download.docker.com/linux/ubuntu jammy stable"
apt-cache policy docker-ce
sudo apt install -y docker-ce
sudo systemctl status docker
Replace jammy with focal if you are on Ubuntu 20.04.
Add your user to the docker group
sudo usermod -aG docker $USER
newgrp docker
This avoids requiring sudo for every docker command. Alternatively, reboot.Install the NVIDIA Container Toolkit (GPU only)
Skip this step if you are running CPU-only inference.curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit-base
sudo apt install -y nvidia-container-runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify that nvidia-smi works inside Docker:sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Install Docker Desktop for Windows. Enable WSL 2 integration during setup for GPU passthrough support. Install Docker Desktop for Mac. Native GPU acceleration (Metal) is not available inside Docker on macOS; use the native macOS install for M1/M2 GPU inference.
Pull the h2oGPT image
Ensure you have the latest image before running:
docker pull gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1
Run h2oGPT
Create required directories and API key file
mkdir -p ~/.cache/huggingface/hub/
mkdir -p ~/.triton/cache/
mkdir -p ~/.config/vllm/
mkdir -p ~/.cache
mkdir -p ~/save
mkdir -p ~/user_path
mkdir -p ~/db_dir_UserData
mkdir -p ~/users
mkdir -p ~/db_nonusers
mkdir -p ~/llamacpp_path
mkdir -p ~/h2ogpt_auth
echo '["key1","key2"]' > ~/h2ogpt_auth/h2ogpt_api_keys.json
Run the container
The example below runs the Zephyr 7B Beta model with GPU access, Gradio on port 7860, and the OpenAI API on port 5000.export GRADIO_SERVER_PORT=7860
export OPENAI_SERVER_PORT=5000
docker run \
--gpus all \
--runtime=nvidia \
--shm-size=2g \
-p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
-p $OPENAI_SERVER_PORT:$OPENAI_SERVER_PORT \
--rm --init \
--network host \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache/huggingface/hub/:/workspace/.cache/huggingface/hub \
-v "${HOME}"/.config:/workspace/.config/ \
-v "${HOME}"/.triton:/workspace/.triton/ \
-v "${HOME}"/save:/workspace/save \
-v "${HOME}"/user_path:/workspace/user_path \
-v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
-v "${HOME}"/users:/workspace/users \
-v "${HOME}"/db_nonusers:/workspace/db_nonusers \
-v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
-v "${HOME}"/h2ogpt_auth:/workspace/h2ogpt_auth \
-e GRADIO_SERVER_PORT=$GRADIO_SERVER_PORT \
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1 /workspace/generate.py \
--base_model=HuggingFaceH4/zephyr-7b-beta \
--use_safetensors=True \
--prompt_type=zephyr \
--save_dir='/workspace/save/' \
--auth_filename='/workspace/h2ogpt_auth/auth.db' \
--h2ogpt_api_keys='/workspace/h2ogpt_auth/h2ogpt_api_keys.json' \
--use_gpu_id=False \
--user_path=/workspace/user_path \
--langchain_mode="LLM" \
--langchain_modes="['UserData', 'LLM']" \
--score_model=None \
--max_max_new_tokens=2048 \
--max_new_tokens=1024 \
--use_auth_token="${HUGGING_FACE_HUB_TOKEN}" \
--openai_port=$OPENAI_SERVER_PORT
Open http://localhost:7860 in your browser.Add -d after docker run to run in detached background mode. Use docker ps to get the container ID and docker stop <hash> to stop it.
For a single GPU use --gpus '"device=0"'; for two GPUs use --gpus '"device=0,1"' instead of --gpus all. To remove key-based access, delete the --h2ogpt_api_keys line. Change key1 and key2 to real secret values before exposing the server to a network.
Run offline (air-gapped)
Set the offline environment variables and run with pre-cached model weights:
export TRANSFORMERS_OFFLINE=1
export GRADIO_SERVER_PORT=7860
export OPENAI_SERVER_PORT=5000
export HF_HUB_OFFLINE=1
docker run --gpus all \
--runtime=nvidia \
--shm-size=2g \
-e TRANSFORMERS_OFFLINE=$TRANSFORMERS_OFFLINE \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e HF_HUB_OFFLINE=$HF_HUB_OFFLINE \
-e HF_HOME="/workspace/.cache/huggingface/" \
-p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
-p $OPENAI_SERVER_PORT:$OPENAI_SERVER_PORT \
--rm --init \
--network host \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache/huggingface/:/workspace/.cache/huggingface \
-v "${HOME}"/save:/workspace/save \
-v "${HOME}"/user_path:/workspace/user_path \
-v "${HOME}"/db_dir_UserData:/workspace/db_dir_UserData \
-v "${HOME}"/users:/workspace/users \
-v "${HOME}"/db_nonusers:/workspace/db_nonusers \
-v "${HOME}"/llamacpp_path:/workspace/llamacpp_path \
-e GRADIO_SERVER_PORT=$GRADIO_SERVER_PORT \
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1 \
/workspace/generate.py \
--base_model=mistralai/Mistral-7B-Instruct-v0.2 \
--use_safetensors=False \
--prompt_type=mistral \
--save_dir='/workspace/save/' \
--use_gpu_id=False \
--user_path=/workspace/user_path \
--langchain_mode="LLM" \
--langchain_modes="['UserData', 'MyData', 'LLM']" \
--score_model=None \
--max_max_new_tokens=2048 \
--max_new_tokens=1024 \
--openai_port=$OPENAI_SERVER_PORT \
--gradio_offline_level=2
Run with vLLM
You can run vLLM in one container and h2oGPT in another on the same host.
Start the vLLM inference server
The example below runs h2oai/h2ogpt-4096-llama2-7b-chat on two GPUs:unset CUDA_VISIBLE_DEVICES
mkdir -p $HOME/.cache/huggingface/hub
mkdir -p $HOME/.triton/cache/
mkdir -p $HOME/.config/vllm
docker run \
--runtime=nvidia \
--gpus '"device=0,1"' \
--shm-size=10.24gb \
-p 5000:5000 \
--rm --init \
--network host \
-e NCCL_IGNORE_DISABLED_P2P=1 \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e VLLM_NO_USAGE_STATS=1 \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:$HOME/.cache/ \
-v "${HOME}"/.config:$HOME/.config/ \
vllm/vllm-openai:latest \
--port=5000 \
--host=0.0.0.0 \
--model=h2oai/h2ogpt-4096-llama2-7b-chat \
--tokenizer=hf-internal-testing/llama-tokenizer \
--tensor-parallel-size=2 \
--seed 1234 \
--trust-remote-code
Connect h2oGPT to the vLLM server
Add --inference_server="vllm:0.0.0.0:5000" to the h2oGPT docker run command, and set --base_model to match the model loaded in vLLM.
Verify the vLLM endpoint
curl http://localhost:5000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "h2oai/h2ogpt-4096-llama2-7b-chat",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
Docker Compose
(Optional) Edit model settings
Open docker-compose.yml and update the environment block to set your desired model and weights.
Build and start
docker-compose up -d --build
Tear down
docker-compose down --volumes --rmi all
Build a custom image
The GCR contains nightly and released images for x86. To build your own image after local changes (for example, to enable Metal support for GGUF files on macOS):
touch build_info.txt
docker build -t h2ogpt .
Then replace gcr.io/vorvan/h2oai/h2ogpt-runtime:0.2.1 with h2ogpt:latest in any docker run command.
For Metal M1/M2 GGUF support, change CMAKE_ARGS in docker_build_script_ubuntu.sh to -DLLAMA_METAL=on and remove the GGML_CUDA=1 line before building.