POST /v1/chat/completions

Create a model response for the given conversation history. This endpoint is fully compatible with the OpenAI Chat Completions API and supports both streaming and non-streaming responses.

Request Body

model

string

required

The model to use for completion. Examples: gemini-2.5-pro, claude-sonnet-4, gpt-4o, or any model supported by your configured providers.

messages

array

required

An array of message objects representing the conversation history.Each message object contains:

role (string, required): One of system, user, assistant, or tool
content (string or array, required): The message content. Can be a string or an array of content parts for multimodal inputs
name (string, optional): The name of the message author
tool_calls (array, optional): Tool calls made by the assistant
tool_call_id (string, optional): The ID of the tool call this message is responding to (for tool messages)

stream

boolean

default:"false"

If set to true, the server will send partial message deltas as Server-Sent Events (SSE). If false, the server will wait until the generation is complete before sending the full response.

temperature

number

default:"1.0"

Sampling temperature between 0 and 2. Higher values make output more random, lower values make it more deterministic.

top_p

number

default:"1.0"

Nucleus sampling parameter. The model considers the results of tokens with top_p probability mass.

integer

default:"1"

Number of chat completion choices to generate. Note: Not all providers support values greater than 1.

max_tokens

integer

The maximum number of tokens to generate. If not specified, the model will generate until it reaches a natural stopping point.

stop

string or array

Up to 4 sequences where the API will stop generating further tokens.

presence_penalty

number

default:"0"

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far.

frequency_penalty

number

default:"0"

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text.

tools

array

A list of tools (functions) the model may call. Each tool object contains:

type (string): Must be function
function (object): Function definition with name, description, and parameters

tool_choice

string or object

Controls which (if any) function is called by the model. Options:

none: Model will not call any function
auto: Model can choose to call a function or generate a message
required: Model must call one or more functions
Object with specific function name to force that function

reasoning_effort

string

Controls the level of reasoning for models that support extended thinking. Options:

none: No extended reasoning
low: Minimal reasoning effort
medium: Moderate reasoning effort
high: Maximum reasoning effort
auto: Let the model decide

This is mapped to provider-specific thinking configurations (e.g., Gemini’s thinkingConfig).

modalities

array

Response modalities for multimodal models. Supported values: text, image.Example: ["text", "image"] for models that can generate both text and images.

Response Format

Non-Streaming Response

string

A unique identifier for the chat completion.

object

string

Always chat.completion.

created

integer

Unix timestamp (in seconds) of when the completion was created.

model

string

The model used for the completion.

choices

array

Array of completion choices. Each choice contains:

index

integer

The index of this choice in the array.

message

object

The generated message.

role

string

Always assistant.

content

string

The content of the message.

tool_calls

array

If the model called functions, this contains the function call details.

finish_reason

string

Why the model stopped generating tokens. Possible values:

stop: Natural completion
length: Maximum token limit reached
tool_calls: Model called a function
content_filter: Content filtered by safety systems

usage

object

Token usage statistics.

prompt_tokens

integer

Number of tokens in the prompt.

completion_tokens

integer

Number of tokens in the completion.

total_tokens

integer

Total tokens used (prompt + completion).

Streaming Response

When stream: true, the server sends chunks as Server-Sent Events (SSE). Each chunk follows this format:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gemini-2.5-pro","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gemini-2.5-pro","choices":[{"index":0,"delta":{"content":" there"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gemini-2.5-pro","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Each chunk contains a delta object with the incremental changes to the message. The stream ends with a data: [DONE] message.

Examples

Basic Chat Completion

curl http://localhost:8317/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gemini-2.5-pro",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Streaming Response

curl http://localhost:8317/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "claude-sonnet-4",
    "messages": [
      {"role": "user", "content": "Tell me a short story."}
    ],
    "stream": true
  }'

Function Calling

curl http://localhost:8317/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gemini-2.5-pro",
    "messages": [
      {"role": "user", "content": "What is the weather in San Francisco?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather in a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              }
            },
            "required": ["location"]
          }
        }
      }
    ]
  }'

With Reasoning Effort

curl http://localhost:8317/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gemini-2.5-flash-thinking-exp",
    "messages": [
      {"role": "user", "content": "Solve this complex math problem: ..."}
    ],
    "reasoning_effort": "high"
  }'

Implementation Details

The /v1/chat/completions endpoint is implemented in sdk/api/handlers/openai/openai_handlers.go. Key behaviors:

Provider Translation: Requests are automatically translated to the target provider’s format (Gemini, Claude, etc.)
Function Calling: Tool calls are preserved across provider translations
Streaming: SSE streaming uses chunked transfer encoding with immediate flushing
Auto-conversion: Some clients send OpenAI Responses-format payloads to this endpoint; these are automatically converted to Chat Completions format

If the request doesn’t include a messages field but has input or instructions, it will be automatically treated as an OpenAI Responses-format request and converted.

Error Responses

All errors follow the OpenAI error format:

{
  "error": {
    "message": "Invalid request: model field is required",
    "type": "invalid_request_error",
    "code": "invalid_request"
  }
}

Common HTTP status codes:

400 Bad Request: Invalid request parameters
401 Unauthorized: Missing or invalid API key
429 Too Many Requests: Rate limit exceeded
500 Internal Server Error: Server-side error
503 Service Unavailable: Provider temporarily unavailable

Overview

OpenAI Compatible

Management API

Chat Completions

POST /v1/chat/completions

Request Body

Response Format

Non-Streaming Response

Streaming Response

Examples

Basic Chat Completion

Streaming Response

Function Calling

With Reasoning Effort

Implementation Details

Error Responses

Build docs developers (and LLMs) love

Overview

OpenAI Compatible

Management API

​POST /v1/chat/completions

​Request Body

​Response Format

​Non-Streaming Response

​Streaming Response

​Examples

​Basic Chat Completion

​Streaming Response

​Function Calling

​With Reasoning Effort

​Implementation Details

​Error Responses

Build docs developers (and LLMs) love

POST /v1/chat/completions

Request Body

Response Format

Non-Streaming Response

Streaming Response

Examples

Basic Chat Completion

Streaming Response

Function Calling

With Reasoning Effort

Implementation Details

Error Responses