Streaming Responses
CLI Proxy API supports two methods for receiving real-time streaming responses:
- Server-Sent Events (SSE): Standard HTTP streaming via the
/v1/chat/completions endpoint
- WebSocket: Bidirectional streaming via the
/v1/responses endpoint (OpenAI Responses format)
Server-Sent Events (SSE)
Enabling Streaming
Set stream: true in your chat completions request:
curl http://localhost:8317/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "gemini-2.5-pro",
"messages": [
{"role": "user", "content": "Write a poem about code"}
],
"stream": true
}'
The server sends chunks as data: prefixed JSON objects:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1694268190,"model":"gemini-2.5-pro","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1694268190,"model":"gemini-2.5-pro","choices":[{"index":0,"delta":{"content":"In"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1694268190,"model":"gemini-2.5-pro","choices":[{"index":0,"delta":{"content":" circuits"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1694268190,"model":"gemini-2.5-pro","choices":[{"index":0,"delta":{"content":" deep"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1694268190,"model":"gemini-2.5-pro","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":10,"completion_tokens":15,"total_tokens":25}}
data: [DONE]
Chunk Structure
Unique identifier for the completion (same across all chunks).
Always chat.completion.chunk.
Unix timestamp when the completion was created.
The model used for generation.
Array of completion choices (typically one element).Choice index (0 for single-choice requests).
Incremental changes to the message.Set to assistant in the first chunk only.
Text content delta for this chunk.
Incremental tool call updates (if function calling is used).
Why generation stopped. null for intermediate chunks, set in final chunk:
stop: Natural completion
length: Max tokens reached
tool_calls: Model called a function
content_filter: Content filtered
Token usage statistics (only in final chunk with finish_reason).
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
Access-Control-Allow-Origin: *
Keep-Alive Behavior
The SSE connection uses chunked transfer encoding and flushes after each chunk. The server automatically handles:
- Immediate Flush: Each chunk is sent immediately to minimize latency
- Connection Management: Graceful handling of client disconnects
- Error Streaming: Errors during streaming are sent as error chunks (not HTTP errors)
WebSocket Streaming
For more advanced use cases, CLI Proxy API provides a WebSocket endpoint that supports:
- Bidirectional communication
- Multi-turn conversations without reconnecting
- Incremental input appending
- Session management
WebSocket Endpoint
ws://localhost:8317/v1/responses
Note: Use wss:// for secure WebSocket connections over TLS.
Authentication
WebSocket authentication is controlled by the websocket_auth configuration setting:
websocket_auth: true # Require authentication (default)
When enabled, authenticate via query parameter or WebSocket subprotocol:
// Via query parameter
const ws = new WebSocket('ws://localhost:8317/v1/responses?api_key=YOUR_API_KEY');
// Via subprotocol (recommended)
const ws = new WebSocket('ws://localhost:8317/v1/responses', ['api-key', 'YOUR_API_KEY']);
Send JSON text messages with one of these types:
1. response.create
Start a new conversation:
{
"type": "response.create",
"model": "gemini-2.5-pro",
"input": [
{"role": "user", "content": "Hello, how are you?"}
],
"instructions": "You are a helpful assistant.",
"stream": true
}
2. response.append
Continue an existing conversation:
{
"type": "response.append",
"input": [
{"role": "user", "content": "Tell me more."}
]
}
The server automatically merges the previous response output with the new input.
For providers that support WebSocket v2 mode (like Codex), use previous_response_id:
{
"type": "response.create",
"model": "codex-model",
"previous_response_id": "resp_abc123",
"input": [
{"role": "user", "content": "Additional context"}
]
}
This sends only the incremental input without expanding the full conversation history.
Response Events
The server sends JSON text messages with different event types:
response.created
{
"type": "response.created",
"sequence_number": 0,
"response": {
"id": "resp_abc123",
"object": "response",
"created_at": 1694268190,
"status": "in_progress",
"model": "gemini-2.5-pro",
"output": []
}
}
Content Deltas
Streaming content chunks:
{
"type": "content.delta",
"sequence_number": 1,
"content_index": 0,
"delta": {
"type": "text",
"text": "Hello"
}
}
response.completed
{
"type": "response.completed",
"sequence_number": 42,
"response": {
"id": "resp_abc123",
"object": "response",
"created_at": 1694268190,
"status": "completed",
"model": "gemini-2.5-pro",
"output": [
{
"type": "message",
"role": "assistant",
"content": "Hello! I'm doing well, thank you for asking."
}
],
"usage": {
"input_tokens": 10,
"output_tokens": 15,
"total_tokens": 25
}
}
}
error
{
"type": "error",
"status": 400,
"error": {
"type": "invalid_request_error",
"message": "Missing model in response.create request"
}
}
WebSocket Implementation
From openai_responses_websocket.go:49-194:
func (h *OpenAIResponsesAPIHandler) ResponsesWebsocket(c *gin.Context) {
conn, err := responsesWebsocketUpgrader.Upgrade(c.Writer, c.Request, nil)
if err != nil {
return
}
// Session management
passthroughSessionID := uuid.NewString()
// Message loop
for {
msgType, payload, err := conn.ReadMessage()
// ... process request ...
// Forward to upstream provider
dataChan, errChan := h.ExecuteStreamWithAuthManager(...)
// Stream response back to client
h.forwardResponsesWebsocket(conn, dataChan, errChan)
}
}
Key features:
- Session Pinning: Once a provider is selected for the first request, subsequent requests in the session use the same provider
- Error Handling: Errors are sent as WebSocket messages, not connection closes
- Automatic Reconnect: Clients can reconnect with the same
x-codex-turn-state header to resume a session
JavaScript Example
const ws = new WebSocket('ws://localhost:8317/v1/responses');
ws.onopen = () => {
// Start conversation
ws.send(JSON.stringify({
type: 'response.create',
model: 'gemini-2.5-pro',
input: [{ role: 'user', content: 'Hello!' }],
stream: true
}));
};
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
switch (message.type) {
case 'response.created':
console.log('Response started:', message.response.id);
break;
case 'content.delta':
process.stdout.write(message.delta.text);
break;
case 'response.completed':
console.log('\nCompleted:', message.response);
// Continue conversation
ws.send(JSON.stringify({
type: 'response.append',
input: [{ role: 'user', content: 'Tell me more.' }]
}));
break;
case 'error':
console.error('Error:', message.error);
ws.close();
break;
}
};
ws.onerror = (error) => {
console.error('WebSocket error:', error);
};
ws.onclose = () => {
console.log('WebSocket closed');
};
Python Example (SSE)
import requests
import json
url = 'http://localhost:8317/v1/chat/completions'
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer YOUR_API_KEY'
}
data = {
'model': 'gemini-2.5-pro',
'messages': [{'role': 'user', 'content': 'Count to 10'}],
'stream': True
}
with requests.post(url, headers=headers, json=data, stream=True) as response:
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
content = line[6:]
if content == '[DONE]':
break
chunk = json.loads(content)
if chunk['choices'][0]['delta'].get('content'):
print(chunk['choices'][0]['delta']['content'], end='', flush=True)
Configuration
Disable WebSocket Authentication
For development or internal networks:
Keep-Alive Settings
For long-running streams, the proxy includes keep-alive mechanisms:
- SSE: Automatic flush after each chunk
- WebSocket: Ping/pong frames handled by underlying gorilla/websocket library
- HTTP Timeouts: Configurable via reverse proxy (Nginx, etc.)
WebSocket connections remain open until explicitly closed by the client or an error occurs. Make sure to implement proper connection management in production.
Error Handling
SSE Errors
Errors during SSE streaming are sent as error chunks:
data: {"error":{"message":"Model quota exceeded","type":"rate_limit_error"},"status":429}
WebSocket Errors
WebSocket errors are sent as error event messages:
{
"type": "error",
"status": 503,
"error": {
"type": "server_error",
"message": "Upstream provider unavailable"
}
}
The connection remains open after error messages, allowing retry logic.
Best Practices
- Connection Reuse: For multiple requests, use WebSocket instead of opening new SSE connections
- Timeout Handling: Implement client-side timeouts for stalled streams
- Graceful Degradation: Fall back to non-streaming if streaming fails
- Buffer Management: Process chunks immediately to avoid memory buildup
- Error Recovery: Implement exponential backoff for reconnection attempts
The streaming implementation automatically handles provider-specific quirks and normalizes responses to the OpenAI format. See sdk/api/handlers/openai/openai_handlers.go for the SSE implementation and openai_responses_websocket.go for WebSocket handling.