API overview

Starting the server

The OpenAI-compatible server is enabled by default when you run generate.py. To be explicit, pass --openai_server=True:

python generate.py --openai_server=True

The server starts on port 5000 by default. On macOS, ports 5000 and 7000 are reserved by AirPlay, so the default shifts to port 5001. To use a custom port:

python generate.py --openai_server=True --openai_port=14365

Base URL

http://localhost:5000/v1

Replace localhost and 5000 with your server host and port. Both http and https are accepted when using a proxy or direct SSL configuration.

Authentication

By default the server runs without API key enforcement. To require an API key:

python generate.py --enforce_h2ogpt_api_key=True --h2ogpt_api_keys="['your-secret-key']"

Keys can also be stored in a JSON file:

python generate.py --enforce_h2ogpt_api_key=True --h2ogpt_api_keys=h2ogpt_api_keys.json

Pass the key from client code as a standard Bearer token:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="your-secret-key",
)

If no key enforcement is configured, pass api_key="EMPTY" or any non-empty string.

Set --enforce_h2ogpt_ui_key=True to separately require authentication for the Gradio UI while keeping the API open, or vice versa.

Parallel workers

To scale throughput, launch multiple isolated FastAPI worker processes:

python generate.py --openai_server=True --openai_workers=4

FastAPI handles concurrency and load balancing across all workers on the same IP and port.

Available endpoints

Method	Path	Description
`GET`	`/v1/models`	List all loaded models
`GET`	`/v1/models/{model}`	Get info for a specific model
`POST`	`/v1/chat/completions`	Chat completions (streaming and non-streaming)
`POST`	`/v1/completions`	Text completions
`POST`	`/v1/embeddings`	Generate embedding vectors
`POST`	`/v1/audio/transcriptions`	Speech-to-text (Whisper)
`POST`	`/v1/audio/speech`	Text-to-speech
`POST`	`/v1/images/generations`	Image generation
`POST`	`/v1/files`	Upload files
`GET`	`/v1/files`	List uploaded files
`GET`	`/health`	Health check
`GET`	`/version`	Server version

Endpoint reference

Chat completions

/v1/chat/completions and /v1/completions — multi-turn chat, vision input, tool calling, JSON mode.

Embeddings

/v1/embeddings — generate dense embedding vectors for text inputs.

Audio

/v1/audio/transcriptions and /v1/audio/speech — Whisper STT and TTS.

Images

/v1/images/generations — generate images with sdxl_turbo, SD3, Flux, and more.

Gradio client

Use the Gradio Python client to call h2oGPT APIs directly, including streaming and document Q&A.

h2oGPT-specific parameters

In addition to standard OpenAI parameters, h2oGPT accepts extended parameters via extra_body. These map directly to h2oGPT’s internal evaluate() parameters, for example:

client_kwargs = dict(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    max_tokens=200,
    stream=False,
    messages=[{"role": "user", "content": "Summarize this document."}],
    extra_body=dict(langchain_mode="UserData", top_k_docs=5),
)

See the H2oGPTParams model in openai_server/server.py for the complete list of accepted fields.

OpenAI-Compatible API

Gradio Client API

Starting the server

Base URL

Authentication

Parallel workers

Available endpoints

Endpoint reference

Chat completions

Embeddings

Audio

Images

Gradio client

h2oGPT-specific parameters

Build docs developers (and LLMs) love

OpenAI-Compatible API

Gradio Client API

​Starting the server

​Base URL

​Authentication

​Parallel workers

​Available endpoints

​Endpoint reference

Chat completions

Embeddings

Audio

Images

Gradio client

​h2oGPT-specific parameters

Build docs developers (and LLMs) love

Starting the server

Base URL

Authentication

Parallel workers

Available endpoints

Endpoint reference

h2oGPT-specific parameters