Skip to main content

Starting the server

The OpenAI-compatible server is enabled by default when you run generate.py. To be explicit, pass --openai_server=True:
python generate.py --openai_server=True
The server starts on port 5000 by default. On macOS, ports 5000 and 7000 are reserved by AirPlay, so the default shifts to port 5001. To use a custom port:
python generate.py --openai_server=True --openai_port=14365

Base URL

http://localhost:5000/v1
Replace localhost and 5000 with your server host and port. Both http and https are accepted when using a proxy or direct SSL configuration.

Authentication

By default the server runs without API key enforcement. To require an API key:
python generate.py --enforce_h2ogpt_api_key=True --h2ogpt_api_keys="['your-secret-key']"
Keys can also be stored in a JSON file:
python generate.py --enforce_h2ogpt_api_key=True --h2ogpt_api_keys=h2ogpt_api_keys.json
Pass the key from client code as a standard Bearer token:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="your-secret-key",
)
If no key enforcement is configured, pass api_key="EMPTY" or any non-empty string.
Set --enforce_h2ogpt_ui_key=True to separately require authentication for the Gradio UI while keeping the API open, or vice versa.

Parallel workers

To scale throughput, launch multiple isolated FastAPI worker processes:
python generate.py --openai_server=True --openai_workers=4
FastAPI handles concurrency and load balancing across all workers on the same IP and port.

Available endpoints

MethodPathDescription
GET/v1/modelsList all loaded models
GET/v1/models/{model}Get info for a specific model
POST/v1/chat/completionsChat completions (streaming and non-streaming)
POST/v1/completionsText completions
POST/v1/embeddingsGenerate embedding vectors
POST/v1/audio/transcriptionsSpeech-to-text (Whisper)
POST/v1/audio/speechText-to-speech
POST/v1/images/generationsImage generation
POST/v1/filesUpload files
GET/v1/filesList uploaded files
GET/healthHealth check
GET/versionServer version

Endpoint reference

Chat completions

/v1/chat/completions and /v1/completions — multi-turn chat, vision input, tool calling, JSON mode.

Embeddings

/v1/embeddings — generate dense embedding vectors for text inputs.

Audio

/v1/audio/transcriptions and /v1/audio/speech — Whisper STT and TTS.

Images

/v1/images/generations — generate images with sdxl_turbo, SD3, Flux, and more.

Gradio client

Use the Gradio Python client to call h2oGPT APIs directly, including streaming and document Q&A.

h2oGPT-specific parameters

In addition to standard OpenAI parameters, h2oGPT accepts extended parameters via extra_body. These map directly to h2oGPT’s internal evaluate() parameters, for example:
client_kwargs = dict(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    max_tokens=200,
    stream=False,
    messages=[{"role": "user", "content": "Summarize this document."}],
    extra_body=dict(langchain_mode="UserData", top_k_docs=5),
)
See the H2oGPTParams model in openai_server/server.py for the complete list of accepted fields.

Build docs developers (and LLMs) love