Overview
The chat completions endpoint provides OpenAI-compatible chat completion functionality with unified access to multiple LLM providers through the Helicone AI Gateway.Authentication
All requests to the AI Gateway require authentication using your Helicone API key in theAuthorization header:
Endpoint
Request Parameters
The model identifier to use for the completion (e.g.,
gpt-4, claude-3-opus-20240229)Array of message objects representing the conversation history. Each message must have a
role and content.Supported roles:system- System instructionsuser- User messagesassistant- Assistant responsestool- Tool/function call resultsfunction- Legacy function call resultsdeveloper- Developer-level instructions
Sampling temperature between 0 and 2. Higher values make output more random. Default varies by model.
Maximum number of tokens to generate in the completion.
Maximum number of completion tokens to generate (alternative to
max_tokens).Nucleus sampling parameter. Alternative to temperature. Value between 0 and 1.
Top-K sampling parameter for limiting token selection.
Whether to stream the response as Server-Sent Events (SSE).
Options for streaming responses.
Up to 4 sequences where the API will stop generating further tokens.
Number of chat completion choices to generate (1-128).
Penalize new tokens based on whether they appear in the text so far (-2.0 to 2.0).
Penalize new tokens based on their frequency in the text so far (-2.0 to 2.0).
Modify the likelihood of specified tokens appearing in the completion.
Whether to return log probabilities of output tokens.
Number of most likely tokens to return at each position (0-20). Requires
logprobs: true.Format for the model’s output.
List of tools the model can call. Use this for function calling.
Controls which tool the model should use. Options:
none, auto, required, or specific tool.Whether to enable parallel function calling.
Unique identifier for the end-user, for monitoring and abuse detection.
Random seed for deterministic sampling.
Service tier to use. Options:
auto, default, flex, scale, priorityAmount of reasoning effort for reasoning models. Options:
minimal, low, medium, highOptions for reasoning models.
Custom metadata to attach to the request for tracking and filtering in Helicone.
Cache control settings for prompt caching.
Key for prompt caching to reuse previous prompts.
Response Format
Non-Streaming Response
Unique identifier for the completion.
Object type, always
chat.completion.Unix timestamp of when the completion was created.
The model used for completion.
Array of completion choices.
Token usage information.
Streaming Response
Whenstream: true, the response is returned as Server-Sent Events (SSE). Each event contains a JSON object with:
[DONE] message.
Error Responses
Error information when a request fails.
Example Responses
Advanced Features
Function Calling
Define tools that the model can use:Vision (Image Input)
Include images in your messages:JSON Mode
Force the model to output valid JSON:Rate Limits
Rate limits are applied at the organization level and vary based on your Helicone plan. Monitor your usage through the Helicone dashboard.Best Practices
- Always include error handling for API calls
- Use streaming for better user experience with long responses
- Set appropriate
max_tokensto control costs - Use
metadatato track and filter requests in Helicone - Implement retry logic with exponential backoff for transient errors