Overview
The proxy chat completions endpoint provides OpenAI-compatible chat completions with additional features like authentication, rate limiting, budgets, and centralized logging.
Endpoint
POST {PROXY_BASE_URL}/v1/chat/completions
Alternate routes:
POST /chat/completions
POST /engines/{model}/chat/completions
POST /openai/deployments/{model}/chat/completions
Authentication
Bearer token for authentication. Authorization: Bearer sk-litellm-xxx...
Content-Type
string
default: "application/json"
Content type of the request body.
Team ID for team-based access control.
JSON stringified metadata for request tracking.
End-user ID for tracking and analytics.
Comma-separated tags for request categorization.
Request Body
The request body follows the OpenAI chat completions format:
Model to use for completion.
Array of message objects. {
"messages" : [
{ "role" : "system" , "content" : "You are a helpful assistant." },
{ "role" : "user" , "content" : "Hello!" }
]
}
Sampling temperature (0-2).
Maximum tokens to generate.
Whether to stream the response.
Tools available for function calling.
See completion() API for all available parameters.
Response
Success Response (200)
Unique identifier for the completion.
Object type (“chat.completion” or “chat.completion.chunk” for streaming).
Unix timestamp of creation.
Model used for the completion.
Array of completion choices.
Token usage information. Tokens in the completion.
Error Responses
Invalid or missing authentication token. {
"error" : {
"message" : "Invalid API key" ,
"type" : "invalid_request_error" ,
"code" : "invalid_api_key"
}
}
Rate limit exceeded. {
"error" : {
"message" : "Rate limit exceeded" ,
"type" : "rate_limit_error"
}
}
Invalid request parameters.
Examples
Basic Request
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-litellm-xxx" \
-d '{
"model": "gpt-4",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
Python Request
import openai
client = openai.OpenAI(
api_key = "sk-litellm-xxx" ,
base_url = "http://localhost:4000"
)
response = client.chat.completions.create(
model = "gpt-4" ,
messages = [
{ "role" : "user" , "content" : "Hello!" }
]
)
print (response.choices[ 0 ].message.content)
Streaming Request
import openai
client = openai.OpenAI(
api_key = "sk-litellm-xxx" ,
base_url = "http://localhost:4000"
)
stream = client.chat.completions.create(
model = "gpt-4" ,
messages = [{ "role" : "user" , "content" : "Count to 10" }],
stream = True
)
for chunk in stream:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" )
import openai
client = openai.OpenAI(
api_key = "sk-litellm-xxx" ,
base_url = "http://localhost:4000"
)
response = client.chat.completions.create(
model = "gpt-4" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
extra_headers = {
"x-litellm-user-id" : "user-123" ,
"x-litellm-metadata" : '{"environment": "production"}' ,
"x-litellm-tags" : "tag1,tag2"
}
)
Function Calling
import openai
client = openai.OpenAI(
api_key = "sk-litellm-xxx" ,
base_url = "http://localhost:4000"
)
tools = [{
"type" : "function" ,
"function" : {
"name" : "get_weather" ,
"description" : "Get the current weather" ,
"parameters" : {
"type" : "object" ,
"properties" : {
"location" : { "type" : "string" }
},
"required" : [ "location" ]
}
}
}]
response = client.chat.completions.create(
model = "gpt-4" ,
messages = [{ "role" : "user" , "content" : "What's the weather in NYC?" }],
tools = tools
)
Proxy-Specific Features
Budget Tracking
The proxy automatically tracks spending against key/team budgets:
# Key will be rejected if budget exceeded
response = client.chat.completions.create(
model = "gpt-4" ,
messages = [{ "role" : "user" , "content" : "Hello" }]
)
# Response includes cost tracking
Rate Limiting
Keys can have TPM (tokens per minute) and RPM (requests per minute) limits:
# Requests are automatically throttled
# 429 error returned if limits exceeded
Model Aliases
Use proxy-defined model aliases:
response = client.chat.completions.create(
model = "gpt-4" , # Can map to specific deployment
messages = [{ "role" : "user" , "content" : "Hello" }]
)
Automatic Retries & Fallbacks
Proxy handles retries and fallbacks automatically:
# If primary deployment fails, proxy tries fallback
response = client.chat.completions.create(
model = "gpt-4" ,
messages = [{ "role" : "user" , "content" : "Hello" }]
)
Monitoring & Logging
All requests are logged with:
Request/response details
Token usage
Costs
Latency
Errors
User/team information
Custom metadata
Access logs through:
Admin UI at /ui
Spend tracking endpoints
Custom callback integrations