Chat Completions

Endpoint

POST /v1/chat/completions

The chat completions endpoint supports both chat-style messages and raw text prompts, compatible with the OpenAI API format.

Request Body

model

string

required

The model ID to use for generation. You can retrieve available models from the /v1/models endpoint.

messages

array

An array of message objects for chat-style completions. Each message has:

role (string): One of “system”, “user”, or “assistant”
content (string): The message content

Either messages or prompt must be provided.

prompt

string

A raw text prompt for completion. Either messages or prompt must be provided.

max_tokens

integer

default:"16"

The maximum number of tokens to generate.

temperature

float

default:"1.0"

Sampling temperature. Use 0.0 for greedy decoding.

top_k

integer

default:"-1"

Top-k sampling parameter. -1 disables top-k filtering.

top_p

float

default:"1.0"

Top-p (nucleus) sampling parameter.

integer

default:"1"

Number of completions to generate (currently only n=1 is supported).

stream

boolean

default:"false"

Whether to stream the response using Server-Sent Events (SSE).

stop

array

default:"[]"

List of stop sequences (currently not implemented).

presence_penalty

float

default:"0.0"

Presence penalty (currently not implemented).

frequency_penalty

float

default:"0.0"

Frequency penalty (currently not implemented).

ignore_eos

boolean

default:"false"

Whether to ignore the end-of-sequence token and continue generation.

Response Format

Streaming Response

When stream=true, the endpoint returns Server-Sent Events (SSE) with the following format:

string

Unique completion ID (format: cmpl-{uid}).

object

string

Always “text_completion.chunk” for streaming responses.

choices

array

Array of completion choices. Each choice contains:

delta (object): Incremental content update
- role (string): “assistant” (only in first chunk)
- content (string): Generated text fragment
index (integer): Choice index (always 0)
finish_reason (string): null during generation, “stop” when complete

The stream ends with a data: [DONE] message.

Examples

Chat Completion with Messages

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Streaming Response

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Write a haiku about programming."}
    ],
    "max_tokens": 50,
    "stream": true
  }'

Example streaming output:

data: {"id":"cmpl-1","object":"text_completion.chunk","choices":[{"delta":{"role":"assistant"},"index":0,"finish_reason":null}]}

data: {"id":"cmpl-1","object":"text_completion.chunk","choices":[{"delta":{"content":"Code"},"index":0,"finish_reason":null}]}

data: {"id":"cmpl-1","object":"text_completion.chunk","choices":[{"delta":{"content":" flows"},"index":0,"finish_reason":null}]}

data: {"id":"cmpl-1","object":"text_completion.chunk","choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}

data: [DONE]

Using Python OpenAI Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # API key not required
)

# Non-streaming
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=150,
    temperature=0.7
)

print(response.choices[0].message.content)

Streaming with Python OpenAI Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Count to 10."}],
    max_tokens=50,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Raw Prompt (Non-Chat)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Once upon a time",
    "max_tokens": 100,
    "temperature": 0.8
  }'

Advanced Sampling Parameters

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Generate creative text."}],
    max_tokens=200,
    temperature=0.9,
    top_p=0.95,
    top_k=50
)

Notes

The endpoint always returns streaming responses (Server-Sent Events format)
Non-streaming responses are not yet implemented
The stop, presence_penalty, and frequency_penalty parameters are accepted but not yet implemented
Greedy decoding is automatically used when temperature=0.0 or top_k=1
Client disconnection triggers automatic request cancellation to free resources

API Endpoints

Python API

Architecture

Endpoint

Request Body

Response Format

Streaming Response

Examples

Chat Completion with Messages

Streaming Response

Using Python OpenAI Client

Streaming with Python OpenAI Client

Raw Prompt (Non-Chat)

Advanced Sampling Parameters

Notes

Build docs developers (and LLMs) love

API Endpoints

Python API

Architecture

​Endpoint

​Request Body

​Response Format

​Streaming Response

​Examples

​Chat Completion with Messages

​Streaming Response

​Using Python OpenAI Client

​Streaming with Python OpenAI Client

​Raw Prompt (Non-Chat)

​Advanced Sampling Parameters

​Notes

Build docs developers (and LLMs) love

Endpoint

Request Body

Response Format

Streaming Response

Examples

Chat Completion with Messages

Streaming Response

Using Python OpenAI Client

Streaming with Python OpenAI Client

Raw Prompt (Non-Chat)

Advanced Sampling Parameters

Notes