Endpoint
Request Body
The model ID to use for generation. You can retrieve available models from the
/v1/models endpoint.An array of message objects for chat-style completions. Each message has:
role(string): One of “system”, “user”, or “assistant”content(string): The message content
messages or prompt must be provided.A raw text prompt for completion. Either
messages or prompt must be provided.The maximum number of tokens to generate.
Sampling temperature. Use 0.0 for greedy decoding.
Top-k sampling parameter. -1 disables top-k filtering.
Top-p (nucleus) sampling parameter.
Number of completions to generate (currently only n=1 is supported).
Whether to stream the response using Server-Sent Events (SSE).
List of stop sequences (currently not implemented).
Presence penalty (currently not implemented).
Frequency penalty (currently not implemented).
Whether to ignore the end-of-sequence token and continue generation.
Response Format
Streaming Response
Whenstream=true, the endpoint returns Server-Sent Events (SSE) with the following format:
Unique completion ID (format:
cmpl-{uid}).Always “text_completion.chunk” for streaming responses.
Array of completion choices. Each choice contains:
delta(object): Incremental content updaterole(string): “assistant” (only in first chunk)content(string): Generated text fragment
index(integer): Choice index (always 0)finish_reason(string): null during generation, “stop” when complete
data: [DONE] message.
Examples
Chat Completion with Messages
Streaming Response
Using Python OpenAI Client
Streaming with Python OpenAI Client
Raw Prompt (Non-Chat)
Advanced Sampling Parameters
Notes
- The endpoint always returns streaming responses (Server-Sent Events format)
- Non-streaming responses are not yet implemented
- The
stop,presence_penalty, andfrequency_penaltyparameters are accepted but not yet implemented - Greedy decoding is automatically used when
temperature=0.0ortop_k=1 - Client disconnection triggers automatic request cancellation to free resources