Quick start

This guide gets you from zero to a working h2oGPT instance using pip. For a full-featured setup with vision, audio, and image generation, use the Docker install instead. Prerequisites: Python 3.10, and either a CPU or an NVIDIA GPU with CUDA 11.8 or 12.1.

Install

Set the PyTorch index URL

Choose the index URL that matches your hardware before running pip install.

CPU / macOS (M1/M2)
Linux / Windows — CUDA 12.1

# On Windows or macOS use "set" or "export" as appropriate
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"

export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu121 https://huggingface.github.io/autogptq-index/whl/cu121"

For CUDA 11.8:

export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu118 https://huggingface.github.io/autogptq-index/whl/cu118"

Configure llama_cpp_python build flags

h2oGPT uses llama.cpp to run GGUF models. Set the build flags that match your hardware before installing.

export GGML_CUDA=1
export CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all"
export FORCE_CMAKE=1

Building with all CUDA architectures takes several minutes but is required — llama_cpp_python fails if you omit all from CMAKE_CUDA_ARCHITECTURES.

Install h2oGPT

pip (recommended)
From source

pip install h2ogpt

git clone https://github.com/h2oai/h2ogpt.git
cd h2ogpt
pip install -r requirements.txt
pip install -r reqs_optional/requirements_optional_langchain.txt

pip uninstall llama_cpp_python llama_cpp_python_cuda -y
pip install -r reqs_optional/requirements_optional_llamacpp_gpt4all.txt --no-cache-dir

pip install -r reqs_optional/requirements_optional_langchain.urls.txt
# GPL packages — only install if acceptable for your use case:
pip install -r reqs_optional/requirements_optional_langchain.gpllike.txt

Run a GGUF model

Start h2oGPT with a Mistral GGUF model

python generate.py \
  --base_model=TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  --prompt_type=mistral \
  --max_seq_len=4096

h2oGPT downloads the model on first run, then starts both the Gradio UI and the OpenAI API server.For a larger context window (requires more GPU memory):

python generate.py \
  --base_model=TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  --prompt_type=mistral \
  --max_seq_len=32768

Llama 3 and other chat-template models

Newer models use a chat template instead of a named --prompt_type. Pass the HuggingFace tokenizer so h2oGPT can read the template:

python generate.py \
  --base_model=llama \
  --model_path_llama=https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf?download=true \
  --tokenizer_base_model=meta-llama/Meta-Llama-3-8B-Instruct \
  --max_seq_len=8192

Or for Phi-3:

python generate.py \
  --tokenizer_base_model=microsoft/Phi-3-mini-4k-instruct \
  --base_model=llama \
  --llama_cpp_model=https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf \
  --max_seq_len=4096

Open the Gradio UI

Navigate to http://localhost:7860 in your browser. The UI starts streaming responses as soon as the model loads.

Call the OpenAI-compatible API

h2oGPT also starts an OpenAI-compatible server at http://localhost:5000/v1. You can use it with any OpenAI SDK client.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    messages=[{"role": "user", "content": "What is h2oGPT?"}],
)
print(response.choices[0].message.content)

The api_key value is ignored by default. To require key-based authentication, pass --h2ogpt_api_keys to generate.py with a JSON file listing valid keys.

Next steps

Docker install

Full capabilities including vision, audio, and image generation

Document Q&A

Chat with your own PDF, Word, and spreadsheet files

GPU inference

Load 8-bit or 4-bit quantized models, use multiple GPUs

API reference

Full OpenAI-compatible REST API documentation

Get Started

Core Features

Models & Backends

Advanced Usage

Help

Install

Run a GGUF model

Call the OpenAI-compatible API

Next steps

Docker install

Document Q&A

GPU inference

API reference

Build docs developers (and LLMs) love

Get Started

Core Features

Models & Backends

Advanced Usage

Help

​Install

​Run a GGUF model

​Call the OpenAI-compatible API

​Next steps

Docker install

Document Q&A

GPU inference

API reference

Build docs developers (and LLMs) love

Install

Run a GGUF model

Call the OpenAI-compatible API

Next steps