Skip to main content

POST /v1/chat/completions

Create a model response for the given conversation history. This endpoint is fully compatible with the OpenAI Chat Completions API and supports both streaming and non-streaming responses.

Request Body

model
string
required
The model to use for completion. Examples: gemini-2.5-pro, claude-sonnet-4, gpt-4o, or any model supported by your configured providers.
messages
array
required
An array of message objects representing the conversation history.Each message object contains:
  • role (string, required): One of system, user, assistant, or tool
  • content (string or array, required): The message content. Can be a string or an array of content parts for multimodal inputs
  • name (string, optional): The name of the message author
  • tool_calls (array, optional): Tool calls made by the assistant
  • tool_call_id (string, optional): The ID of the tool call this message is responding to (for tool messages)
stream
boolean
default:"false"
If set to true, the server will send partial message deltas as Server-Sent Events (SSE). If false, the server will wait until the generation is complete before sending the full response.
temperature
number
default:"1.0"
Sampling temperature between 0 and 2. Higher values make output more random, lower values make it more deterministic.
top_p
number
default:"1.0"
Nucleus sampling parameter. The model considers the results of tokens with top_p probability mass.
n
integer
default:"1"
Number of chat completion choices to generate. Note: Not all providers support values greater than 1.
max_tokens
integer
The maximum number of tokens to generate. If not specified, the model will generate until it reaches a natural stopping point.
stop
string or array
Up to 4 sequences where the API will stop generating further tokens.
presence_penalty
number
default:"0"
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far.
frequency_penalty
number
default:"0"
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text.
tools
array
A list of tools (functions) the model may call. Each tool object contains:
  • type (string): Must be function
  • function (object): Function definition with name, description, and parameters
tool_choice
string or object
Controls which (if any) function is called by the model. Options:
  • none: Model will not call any function
  • auto: Model can choose to call a function or generate a message
  • required: Model must call one or more functions
  • Object with specific function name to force that function
reasoning_effort
string
Controls the level of reasoning for models that support extended thinking. Options:
  • none: No extended reasoning
  • low: Minimal reasoning effort
  • medium: Moderate reasoning effort
  • high: Maximum reasoning effort
  • auto: Let the model decide
This is mapped to provider-specific thinking configurations (e.g., Gemini’s thinkingConfig).
modalities
array
Response modalities for multimodal models. Supported values: text, image.Example: ["text", "image"] for models that can generate both text and images.

Response Format

Non-Streaming Response

id
string
A unique identifier for the chat completion.
object
string
Always chat.completion.
created
integer
Unix timestamp (in seconds) of when the completion was created.
model
string
The model used for the completion.
choices
array
Array of completion choices. Each choice contains:
index
integer
The index of this choice in the array.
message
object
The generated message.
role
string
Always assistant.
content
string
The content of the message.
tool_calls
array
If the model called functions, this contains the function call details.
finish_reason
string
Why the model stopped generating tokens. Possible values:
  • stop: Natural completion
  • length: Maximum token limit reached
  • tool_calls: Model called a function
  • content_filter: Content filtered by safety systems
usage
object
Token usage statistics.
prompt_tokens
integer
Number of tokens in the prompt.
completion_tokens
integer
Number of tokens in the completion.
total_tokens
integer
Total tokens used (prompt + completion).

Streaming Response

When stream: true, the server sends chunks as Server-Sent Events (SSE). Each chunk follows this format:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gemini-2.5-pro","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gemini-2.5-pro","choices":[{"index":0,"delta":{"content":" there"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"gemini-2.5-pro","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]
Each chunk contains a delta object with the incremental changes to the message. The stream ends with a data: [DONE] message.

Examples

Basic Chat Completion

curl http://localhost:8317/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gemini-2.5-pro",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Streaming Response

curl http://localhost:8317/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "claude-sonnet-4",
    "messages": [
      {"role": "user", "content": "Tell me a short story."}
    ],
    "stream": true
  }'

Function Calling

curl http://localhost:8317/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gemini-2.5-pro",
    "messages": [
      {"role": "user", "content": "What is the weather in San Francisco?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather in a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              }
            },
            "required": ["location"]
          }
        }
      }
    ]
  }'

With Reasoning Effort

curl http://localhost:8317/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "gemini-2.5-flash-thinking-exp",
    "messages": [
      {"role": "user", "content": "Solve this complex math problem: ..."}
    ],
    "reasoning_effort": "high"
  }'

Implementation Details

The /v1/chat/completions endpoint is implemented in sdk/api/handlers/openai/openai_handlers.go. Key behaviors:
  • Provider Translation: Requests are automatically translated to the target provider’s format (Gemini, Claude, etc.)
  • Function Calling: Tool calls are preserved across provider translations
  • Streaming: SSE streaming uses chunked transfer encoding with immediate flushing
  • Auto-conversion: Some clients send OpenAI Responses-format payloads to this endpoint; these are automatically converted to Chat Completions format
If the request doesn’t include a messages field but has input or instructions, it will be automatically treated as an OpenAI Responses-format request and converted.

Error Responses

All errors follow the OpenAI error format:
{
  "error": {
    "message": "Invalid request: model field is required",
    "type": "invalid_request_error",
    "code": "invalid_request"
  }
}
Common HTTP status codes:
  • 400 Bad Request: Invalid request parameters
  • 401 Unauthorized: Missing or invalid API key
  • 429 Too Many Requests: Rate limit exceeded
  • 500 Internal Server Error: Server-side error
  • 503 Service Unavailable: Provider temporarily unavailable

Build docs developers (and LLMs) love