Introduction

h2oGPT lets you query and summarize your documents or chat with local private GPT LLMs entirely on your own hardware. No data leaves your network. It ships with a Gradio UI, a drop-in OpenAI-compatible API server, and support for dozens of model backends.

Key features

Private and offline — all inference runs locally; no data is sent to external servers unless you configure a remote backend
Apache 2.0 license — free for personal and commercial use
OpenAI-compatible API — h2oGPT acts as a drop-in replacement for the OpenAI server on localhost:5000/v1, with chat completions, embeddings, audio, image generation, and function tool calling
Gradio UI — streaming chat UI with multi-model bake-off, document upload, authentication, and state preservation
Document Q&A — persistent vector databases (Chroma, Weaviate, FAISS) over PDFs, Word, Excel, images, video frames, YouTube, audio, code, Markdown, and more
Vision and image support — understand images with LLaVA, Claude-3, Gemini-Pro-Vision, and GPT-4-Vision; generate images with Stable Diffusion (SDXL, SD3) and Flux
Voice STT and TTS — Whisper speech-to-text with streaming audio; Microsoft SpeechT5 and Coqui TTS with voice cloning
Wide model support — LLaMa 2/3, Mistral, Falcon, Vicuna, WizardLM, and others via HuggingFace Transformers, llama.cpp GGUF, AutoGPTQ, ExLLaMa, vLLM, TGI, and more
Agents — autonomous agents for web search, document Q&A, Python code execution, and CSV analysis
Cross-platform — Linux, macOS (CPU and Metal M1/M2), Windows 10/11, and Docker

Choose an installation method

Docker

Recommended for Linux, Windows, and macOS. Provides full capabilities including GPU inference, vision, audio, and image generation without manual dependency management.

Linux

Native install on Ubuntu x86_64 using Miniconda and pip. Supports CUDA 12.1/11.8 and CPU modes.

Windows

Install on Windows 10/11 using a single .bat script with Miniconda, Visual Studio build tools, and optional CUDA support.

macOS

Native install for Apple Silicon (M1/M2 Metal MPS) and Intel Macs using Miniconda and pip.

For the fastest path to a working setup on any platform, see the Quick Start guide.

Architecture overview

When you run python generate.py, h2oGPT starts two servers:

Gradio UI at http://localhost:7860 — the interactive chat and document interface
OpenAI-compatible API server at http://localhost:5000/v1 — a REST API that any OpenAI SDK client can connect to

Both servers share the same model backend. You can run a local model (llama.cpp, HuggingFace Transformers, AutoGPTQ) or point h2oGPT at a remote inference server such as vLLM, TGI, oLLaMa, or the OpenAI API.

Some optional packages — DocTR, Unstructured, Florence-2, Stable Diffusion — download additional model weights at runtime. Progress is shown in the console.

Get Started

Core Features

Models & Backends

Advanced Usage

Help

Key features

Choose an installation method

Docker

Linux

Windows

macOS

Architecture overview

Build docs developers (and LLMs) love

Get Started

Core Features

Models & Backends

Advanced Usage

Help

​Key features

​Choose an installation method

Docker

Linux

Windows

macOS

​Architecture overview

Build docs developers (and LLMs) love

Key features

Choose an installation method

Architecture overview