Skip to main content
This guide gets you from zero to a working h2oGPT instance using pip. For a full-featured setup with vision, audio, and image generation, use the Docker install instead. Prerequisites: Python 3.10, and either a CPU or an NVIDIA GPU with CUDA 11.8 or 12.1.

Install

1

Set the PyTorch index URL

Choose the index URL that matches your hardware before running pip install.
# On Windows or macOS use "set" or "export" as appropriate
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
2

Configure llama_cpp_python build flags

h2oGPT uses llama.cpp to run GGUF models. Set the build flags that match your hardware before installing.
export GGML_CUDA=1
export CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all"
export FORCE_CMAKE=1
Building with all CUDA architectures takes several minutes but is required — llama_cpp_python fails if you omit all from CMAKE_CUDA_ARCHITECTURES.
3

Install h2oGPT

Run a GGUF model

1

Start h2oGPT with a Mistral GGUF model

python generate.py \
  --base_model=TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  --prompt_type=mistral \
  --max_seq_len=4096
h2oGPT downloads the model on first run, then starts both the Gradio UI and the OpenAI API server.For a larger context window (requires more GPU memory):
python generate.py \
  --base_model=TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  --prompt_type=mistral \
  --max_seq_len=32768
2

Llama 3 and other chat-template models

Newer models use a chat template instead of a named --prompt_type. Pass the HuggingFace tokenizer so h2oGPT can read the template:
python generate.py \
  --base_model=llama \
  --model_path_llama=https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf?download=true \
  --tokenizer_base_model=meta-llama/Meta-Llama-3-8B-Instruct \
  --max_seq_len=8192
Or for Phi-3:
python generate.py \
  --tokenizer_base_model=microsoft/Phi-3-mini-4k-instruct \
  --base_model=llama \
  --llama_cpp_model=https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf \
  --max_seq_len=4096
3

Open the Gradio UI

Navigate to http://localhost:7860 in your browser. The UI starts streaming responses as soon as the model loads.

Call the OpenAI-compatible API

h2oGPT also starts an OpenAI-compatible server at http://localhost:5000/v1. You can use it with any OpenAI SDK client.
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    messages=[{"role": "user", "content": "What is h2oGPT?"}],
)
print(response.choices[0].message.content)
The api_key value is ignored by default. To require key-based authentication, pass --h2ogpt_api_keys to generate.py with a JSON file listing valid keys.

Next steps

Docker install

Full capabilities including vision, audio, and image generation

Document Q&A

Chat with your own PDF, Word, and spreadsheet files

GPU inference

Load 8-bit or 4-bit quantized models, use multiple GPUs

API reference

Full OpenAI-compatible REST API documentation

Build docs developers (and LLMs) love