POST /v1/chat/completions

Overview

The proxy chat completions endpoint provides OpenAI-compatible chat completions with additional features like authentication, rate limiting, budgets, and centralized logging.

Endpoint

POST {PROXY_BASE_URL}/v1/chat/completions

Alternate routes:

POST /chat/completions
POST /engines/{model}/chat/completions
POST /openai/deployments/{model}/chat/completions

Authentication

Authorization

string

required

Bearer token for authentication.

Authorization: Bearer sk-litellm-xxx...

Request Headers

Content-Type

string

default:"application/json"

Content type of the request body.

x-litellm-team-id

string

Team ID for team-based access control.

x-litellm-metadata

string

JSON stringified metadata for request tracking.

x-litellm-user-id

string

End-user ID for tracking and analytics.

x-litellm-tags

string

Comma-separated tags for request categorization.

Request Body

The request body follows the OpenAI chat completions format:

model

string

required

Model to use for completion.

{"model": "gpt-4"}

messages

array

required

Array of message objects.

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ]
}

temperature

number

Sampling temperature (0-2).

max_tokens

integer

Maximum tokens to generate.

stream

boolean

default:"false"

Whether to stream the response.

tools

array

Tools available for function calling.

See completion() API for all available parameters.

Response

Success Response (200)

string

Unique identifier for the completion.

object

string

Object type (“chat.completion” or “chat.completion.chunk” for streaming).

created

integer

Unix timestamp of creation.

model

string

Model used for the completion.

choices

array

Array of completion choices.

usage

object

Token usage information.

Show usage object

prompt_tokens

integer

Tokens in the prompt.

completion_tokens

integer

Tokens in the completion.

total_tokens

integer

Total tokens used.

Error Responses

401 Unauthorized

Invalid or missing authentication token.

{
  "error": {
    "message": "Invalid API key",
    "type": "invalid_request_error",
    "code": "invalid_api_key"
  }
}

429 Too Many Requests

Rate limit exceeded.

{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error"
  }
}

400 Bad Request

Invalid request parameters.

Examples

Basic Request

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-litellm-xxx" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Python Request

import openai

client = openai.OpenAI(
    api_key="sk-litellm-xxx",
    base_url="http://localhost:4000"
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

Streaming Request

import openai

client = openai.OpenAI(
    api_key="sk-litellm-xxx",
    base_url="http://localhost:4000"
)

stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Count to 10"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

With Metadata

import openai

client = openai.OpenAI(
    api_key="sk-litellm-xxx",
    base_url="http://localhost:4000"
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={
        "x-litellm-user-id": "user-123",
        "x-litellm-metadata": '{"environment": "production"}',
        "x-litellm-tags": "tag1,tag2"
    }
)

Function Calling

import openai

client = openai.OpenAI(
    api_key="sk-litellm-xxx",
    base_url="http://localhost:4000"
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What's the weather in NYC?"}],
    tools=tools
)

Proxy-Specific Features

Budget Tracking

The proxy automatically tracks spending against key/team budgets:

# Key will be rejected if budget exceeded
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)
# Response includes cost tracking

Rate Limiting

Keys can have TPM (tokens per minute) and RPM (requests per minute) limits:

# Requests are automatically throttled
# 429 error returned if limits exceeded

Model Aliases

Use proxy-defined model aliases:

response = client.chat.completions.create(
    model="gpt-4",  # Can map to specific deployment
    messages=[{"role": "user", "content": "Hello"}]
)

Automatic Retries & Fallbacks

Proxy handles retries and fallbacks automatically:

# If primary deployment fails, proxy tries fallback
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

Monitoring & Logging

All requests are logged with:

Request/response details
Token usage
Costs
Latency
Errors
User/team information
Custom metadata

Access logs through:

Admin UI at /ui
Spend tracking endpoints
Custom callback integrations

SDK Reference

Proxy Endpoints

Configuration

POST /v1/chat/completions

Overview

Endpoint

Authentication

Request Headers

Request Body

Response

Success Response (200)

Error Responses

Examples

Basic Request

Python Request

Streaming Request

With Metadata

Function Calling

Proxy-Specific Features

Budget Tracking

Rate Limiting

Model Aliases

Automatic Retries & Fallbacks

Monitoring & Logging

Build docs developers (and LLMs) love

SDK Reference

Proxy Endpoints

Configuration

​Overview

​Endpoint

​Authentication

​Request Headers

​Request Body

​Response

​Success Response (200)

​Error Responses

​Examples

​Basic Request

​Python Request

​Streaming Request

​With Metadata

​Function Calling

​Proxy-Specific Features

​Budget Tracking

​Rate Limiting

​Model Aliases

​Automatic Retries & Fallbacks

​Monitoring & Logging

​Related

Build docs developers (and LLMs) love

Overview

Endpoint

Authentication

Request Headers

Request Body

Response

Success Response (200)

Error Responses

Examples

Basic Request

Python Request

Streaming Request

With Metadata

Function Calling

Proxy-Specific Features

Budget Tracking

Rate Limiting

Model Aliases

Automatic Retries & Fallbacks

Monitoring & Logging

Related