Skip to main content

Overview

The Chat Completions API provides an OpenAI-compatible interface for conversational AI with Qwen models. It supports chat conversations, function calling, streaming responses, and custom generation parameters.

Endpoint

POST http://localhost:8000/v1/chat/completions

Request

Request Body

{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is quantum computing?"
    }
  ],
  "temperature": 0.7,
  "top_p": 0.9,
  "max_length": 2048,
  "stream": false
}

Parameters

model
string
required
Model identifier (currently returns “gpt-3.5-turbo” regardless of input)
messages
array
required
Array of message objects forming the conversation. Each message has:
  • role: One of "system", "user", "assistant", or "function"
  • content: Message text content
  • function_call: (Optional) Function call object for assistant messages
temperature
float
default:"None"
Sampling temperature between 0 and 2:
  • < 0.01: Effectively greedy decoding (sets top_k=1)
  • 0.7-0.9: Balanced creativity
  • > 1.0: More random
Note: Tuning top_p is recommended over temperature.
top_p
float
default:"None"
Nucleus sampling probability threshold (0-1)
top_k
int
default:"None"
Limits sampling to top K tokens
max_length
int
default:"None"
Maximum total sequence length (input + output tokens)
stream
boolean
default:"false"
Whether to stream partial responses as Server-Sent Events (SSE)
stop
array[string]
default:"None"
Up to 4 sequences where generation should stop:
"stop": ["\n\n", "User:"]
functions
array
default:"None"
List of function definitions for function calling:
"functions": [
  {
    "name": "get_weather",
    "description": "Get current weather",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {"type": "string"}
      }
    }
  }
]
Note: Not supported in stream mode.

Response

Non-Streaming Response

{
  "model": "gpt-3.5-turbo",
  "object": "chat.completion",
  "created": 1677652288,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing is a type of computing that uses quantum-mechanical phenomena..."
      },
      "finish_reason": "stop"
    }
  ]
}

Response Fields

model
string
Model identifier from request
object
string
Object type: "chat.completion" or "chat.completion.chunk" (streaming)
created
integer
Unix timestamp of when the completion was created
choices
array
Array of completion choices (usually length 1)
choices[].index
integer
Choice index (always 0)
choices[].message
object
Generated message with:
  • role: Always "assistant"
  • content: Generated text
  • function_call: (Optional) Function call object
choices[].finish_reason
string
Reason generation stopped:
  • "stop": Natural completion or stop sequence
  • "length": Reached max_length
  • "function_call": Model wants to call a function

Streaming Responses

When stream=true, responses are sent as Server-Sent Events:
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

Stream Response Format

Each chunk is a JSON object:
data: {"model":"gpt-3.5-turbo","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"model":"gpt-3.5-turbo","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"1"},"finish_reason":null}]}

data: {"model":"gpt-3.5-turbo","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":", 2"},"finish_reason":null}]}

data: {"model":"gpt-3.5-turbo","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Function Calling

The API supports function calling for tool use:
import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "user", "content": "What's the weather in Boston?"}
        ],
        "functions": [
            {
                "name": "get_weather",
                "description": "Get current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City name"
                        }
                    },
                    "required": ["location"]
                }
            }
        ]
    }
)

result = response.json()
choice = result["choices"][0]

if choice["finish_reason"] == "function_call":
    func_call = choice["message"]["function_call"]
    print(f"Function: {func_call['name']}")
    print(f"Arguments: {func_call['arguments']}")

Function Call Response

{
  "model": "gpt-3.5-turbo",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I need to check the weather.",
        "function_call": {
          "name": "get_weather",
          "arguments": "{\"location\": \"Boston\"}"
        }
      },
      "finish_reason": "function_call"
    }
  ]
}

Python Client Example

import openai

# Configure client
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"  # Not required for local server

# Create chat completion
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain neural networks simply."}
    ],
    temperature=0.7,
    max_tokens=256
)

print(response.choices[0].message.content)

Multi-turn Conversation

import requests

url = "http://localhost:8000/v1/chat/completions"
messages = [{"role": "system", "content": "You are a helpful assistant."}]

# First turn
messages.append({"role": "user", "content": "Hello!"})
response = requests.post(url, json={"model": "gpt-3.5-turbo", "messages": messages})
assistant_msg = response.json()["choices"][0]["message"]
messages.append(assistant_msg)

print(f"Assistant: {assistant_msg['content']}")

# Second turn
messages.append({"role": "user", "content": "Tell me a joke"})
response = requests.post(url, json={"model": "gpt-3.5-turbo", "messages": messages})
assistant_msg = response.json()["choices"][0]["message"]
messages.append(assistant_msg)

print(f"Assistant: {assistant_msg['content']}")

Error Responses

Invalid Request

{
  "detail": "Invalid request: Expecting at least one user message."
}

Status Codes

  • 200: Success
  • 400: Bad request (invalid parameters or message format)
  • 401: Unauthorized (if API authentication is enabled)
  • 500: Internal server error

Build docs developers (and LLMs) love