Chat completions

Endpoints

Method	Path
`POST`	`/v1/chat/completions`
`POST`	`/v1/completions`

Both endpoints accept the same base parameters. /v1/completions uses a prompt string rather than a messages array.

Request parameters

model

string

The model identifier to use for generation. Corresponds to the base_model value returned by GET /v1/models. If omitted, the server selects the first available model.

messages

object[]

required

Array of message objects forming the conversation history. Each object must have role (system, user, or assistant) and content (string or array for vision).

max_tokens

number

default:"256"

Maximum number of new tokens to generate.

stream

boolean

default:"false"

When true, tokens are returned as server-sent events as they are generated.

temperature

number

default:"0.3"

Sampling temperature between 0 and 2. Lower values produce more deterministic output.

top_p

number

default:"1.0"

Nucleus sampling probability mass.

seed

number

default:"0"

Random seed. 0 means a random seed is chosen each request.

stop

string | string[]

One or more sequences at which generation stops.

frequency_penalty

number

default:"0.0"

Penalizes tokens based on how frequently they have appeared so far.

presence_penalty

number

default:"0.0"

Penalizes tokens that have appeared at all in the generated text so far.

user

string

An optional user identifier. When h2oGPT authentication is enabled, pass username:password here to authenticate.

response_format

object

Controls the output format. See JSON mode below.

Show properties

type

string

One of "text", "json_object", "json_code", or "json_schema".

json_schema

object

Required when type is "json_schema". Must contain name and schema keys where schema is a JSON Schema object.

tools

object[]

List of tools the model may call. Each tool must have type: "function" and a function object with name and description.

tool_choice

string

Set to "auto" to let the model decide which tool to call.

extra_body

object

h2oGPT-specific parameters. Any field from H2oGPTParams in openai_server/server.py can be passed here, for example langchain_mode, top_k_docs, system_prompt, and chat_conversation.

Response

string

Unique identifier for this completion.

object

string

"chat.completion" or "chat.completion.chunk" for streaming.

created

number

Unix timestamp when the completion was created.

model

string

The model used for generation.

choices

object[]

Show properties

index

number

Choice index.

message

object

The generated message. Has role and content fields. In streaming mode this becomes delta.

finish_reason

string

Why generation stopped: "stop", "length", or "tool_calls".

usage

object

Token usage statistics with prompt_tokens, completion_tokens, and total_tokens.

Non-streaming example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="EMPTY",
)

messages = [{"role": "user", "content": "Who are you?"}]

response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=messages,
    max_tokens=200,
    stream=False,
)

print(response.choices[0].message.content)

Streaming example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="EMPTY",
)

messages = [{"role": "user", "content": "Who are you?"}]

responses = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=messages,
    max_tokens=200,
    stream=True,
)

text = ""
for chunk in responses:
    delta = chunk.choices[0].delta.content
    if delta:
        text += delta
        print(delta, end="", flush=True)

Text completions

POST /v1/completions accepts a prompt string instead of a messages array:

export OPENAI_API_KEY=EMPTY

curl http://localhost:5000/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "prompt": "Who are you?",
    "max_tokens": 200,
    "temperature": 0,
    "seed": 1234,
    "h2ogpt_key": "EMPTY"
  }'

Vision / image understanding

To send an image alongside a text prompt, use the image_url content type in the messages array. The URL can be an https:// URL or a base64-encoded data URI.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="EMPTY",
)

messages = [{
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": "Describe the image please",
        },
        {
            "type": "image_url",
            "image_url": {
                "url": "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg",
            },
        },
    ],
}]

response = client.chat.completions.create(
    model="OpenGVLab/InternVL-Chat-V1-5",
    messages=messages,
    max_tokens=200,
    stream=False,
)
print(response.choices[0].message.content)

For local images, encode the file as a base64 data URI:

import base64

with open("image.jpeg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode("utf-8")

url = f"data:image/jpeg;base64,{b64}"

Vision requires a vision-capable model such as OpenGVLab/InternVL-Chat-V1-5 or THUDM/cogvlm2-llama3-chat-19B. Load h2oGPT with that model set as visible_models.

Tool calling

Pass a list of function definitions in tools and set tool_choice="auto":

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/v1",
    api_key="EMPTY",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=[{"role": "user", "content": "What is the weather in Paris?"}],
    tools=tools,
    tool_choice="auto",
)
print(response.choices[0].message.content)

The server uses guided JSON generation internally to select the best-matching tool.

JSON mode

Set response_format to force JSON output:

response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=[{"role": "user", "content": "Return a JSON object with keys name and age."}],
    response_format={"type": "json_object"},
)

For structured output with a schema, use json_schema:

schema = {
    "name": "person",
    "schema": {
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "age": {"type": "integer"},
        },
        "required": ["name", "age"],
    },
}

response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=[{"role": "user", "content": "Return a person object."}],
    response_format={"type": "json_schema", "json_schema": schema},
)

Supported response_format.type values: text, json_object, json_code, json_schema.

Authentication with user credentials

When the h2oGPT server uses --auth_access=closed, pass user as username:password:

response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=200,
    user="myuser:mypassword",
)

Using extra_body for h2oGPT parameters

response = client.chat.completions.create(
    model="h2oai/h2ogpt-4096-llama2-70b-chat",
    messages=[{"role": "user", "content": "Summarize the documents in UserData."}],
    max_tokens=400,
    extra_body=dict(
        langchain_mode="UserData",
        top_k_docs=5,
        langchain_action="Query",
    ),
)

OpenAI-Compatible API

Gradio Client API

Chat completions

Endpoints

Request parameters

Response

Non-streaming example

Streaming example

Text completions

Vision / image understanding

Tool calling

JSON mode

Authentication with user credentials

Using extra_body for h2oGPT parameters

Build docs developers (and LLMs) love

OpenAI-Compatible API

Gradio Client API

​Endpoints

​Request parameters

​Response

​Non-streaming example

​Streaming example

​Text completions

​Vision / image understanding

​Tool calling

​JSON mode

​Authentication with user credentials

​Using extra_body for h2oGPT parameters

Build docs developers (and LLMs) love

Endpoints

Request parameters

Response

Non-streaming example

Streaming example

Text completions

Vision / image understanding

Tool calling

JSON mode

Authentication with user credentials

Using extra_body for h2oGPT parameters