Skip to main content

Endpoints

MethodPath
POST/v1/chat/completions
POST/v1/completions
Both endpoints accept the same base parameters. /v1/completions uses a prompt string rather than a messages array.

Request parameters

model
string
The model identifier to use for generation. Corresponds to the base_model value returned by GET /v1/models. If omitted, the server selects the first available model.
messages
object[]
required
Array of message objects forming the conversation history. Each object must have role (system, user, or assistant) and content (string or array for vision).
max_tokens
number
default:"256"
Maximum number of new tokens to generate.
stream
boolean
default:"false"
When true, tokens are returned as server-sent events as they are generated.
temperature
number
default:"0.3"
Sampling temperature between 0 and 2. Lower values produce more deterministic output.
top_p
number
default:"1.0"
Nucleus sampling probability mass.
seed
number
default:"0"
Random seed. 0 means a random seed is chosen each request.
stop
string | string[]
One or more sequences at which generation stops.
frequency_penalty
number
default:"0.0"
Penalizes tokens based on how frequently they have appeared so far.
presence_penalty
number
default:"0.0"
Penalizes tokens that have appeared at all in the generated text so far.
user
string
An optional user identifier. When h2oGPT authentication is enabled, pass username:password here to authenticate.
response_format
object
Controls the output format. See JSON mode below.
tools
object[]
List of tools the model may call. Each tool must have type: "function" and a function object with name and description.
tool_choice
string
Set to "auto" to let the model decide which tool to call.
extra_body
object
h2oGPT-specific parameters. Any field from H2oGPTParams in openai_server/server.py can be passed here, for example langchain_mode, top_k_docs, system_prompt, and chat_conversation.

Response

id
string
Unique identifier for this completion.
object
string
"chat.completion" or "chat.completion.chunk" for streaming.
created
number
Unix timestamp when the completion was created.
model
string
The model used for generation.
choices
object[]
usage
object
Token usage statistics with prompt_tokens, completion_tokens, and total_tokens.

Non-streaming example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="EMPTY",
)

messages = [{"role": "user", "content": "Who are you?"}]

response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=messages,
    max_tokens=200,
    stream=False,
)

print(response.choices[0].message.content)

Streaming example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="EMPTY",
)

messages = [{"role": "user", "content": "Who are you?"}]

responses = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=messages,
    max_tokens=200,
    stream=True,
)

text = ""
for chunk in responses:
    delta = chunk.choices[0].delta.content
    if delta:
        text += delta
        print(delta, end="", flush=True)

Text completions

POST /v1/completions accepts a prompt string instead of a messages array:
export OPENAI_API_KEY=EMPTY

curl http://localhost:5000/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "prompt": "Who are you?",
    "max_tokens": 200,
    "temperature": 0,
    "seed": 1234,
    "h2ogpt_key": "EMPTY"
  }'

Vision / image understanding

To send an image alongside a text prompt, use the image_url content type in the messages array. The URL can be an https:// URL or a base64-encoded data URI.
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="EMPTY",
)

messages = [{
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": "Describe the image please",
        },
        {
            "type": "image_url",
            "image_url": {
                "url": "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg",
            },
        },
    ],
}]

response = client.chat.completions.create(
    model="OpenGVLab/InternVL-Chat-V1-5",
    messages=messages,
    max_tokens=200,
    stream=False,
)
print(response.choices[0].message.content)
For local images, encode the file as a base64 data URI:
import base64

with open("image.jpeg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode("utf-8")

url = f"data:image/jpeg;base64,{b64}"
Vision requires a vision-capable model such as OpenGVLab/InternVL-Chat-V1-5 or THUDM/cogvlm2-llama3-chat-19B. Load h2oGPT with that model set as visible_models.

Tool calling

Pass a list of function definitions in tools and set tool_choice="auto":
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="EMPTY",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=[{"role": "user", "content": "What is the weather in Paris?"}],
    tools=tools,
    tool_choice="auto",
)
print(response.choices[0].message.content)
The server uses guided JSON generation internally to select the best-matching tool.

JSON mode

Set response_format to force JSON output:
response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=[{"role": "user", "content": "Return a JSON object with keys name and age."}],
    response_format={"type": "json_object"},
)
For structured output with a schema, use json_schema:
schema = {
    "name": "person",
    "schema": {
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "age": {"type": "integer"},
        },
        "required": ["name", "age"],
    },
}

response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=[{"role": "user", "content": "Return a person object."}],
    response_format={"type": "json_schema", "json_schema": schema},
)
Supported response_format.type values: text, json_object, json_code, json_schema.

Authentication with user credentials

When the h2oGPT server uses --auth_access=closed, pass user as username:password:
response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=200,
    user="myuser:mypassword",
)

Using extra_body for h2oGPT parameters

response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=[{"role": "user", "content": "Summarize the documents in UserData."}],
    max_tokens=400,
    extra_body=dict(
        langchain_mode="UserData",
        top_k_docs=5,
        langchain_action="Query",
    ),
)

Build docs developers (and LLMs) love