Chat Completions

Overview

The chat completions endpoint provides OpenAI-compatible chat completion functionality with unified access to multiple LLM providers through the Helicone AI Gateway.

Authentication

All requests to the AI Gateway require authentication using your Helicone API key in the Authorization header:

Authorization: Bearer YOUR_HELICONE_API_KEY

Endpoint

curl https://ai-gateway.helicone.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_HELICONE_API_KEY" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you?"
      }
    ]
  }'

Request Parameters

model

string

required

The model identifier to use for the completion (e.g., gpt-4, claude-3-opus-20240229)

messages

array

required

Array of message objects representing the conversation history. Each message must have a role and content.Supported roles:

system - System instructions
user - User messages
assistant - Assistant responses
tool - Tool/function call results
function - Legacy function call results
developer - Developer-level instructions

temperature

number

Sampling temperature between 0 and 2. Higher values make output more random. Default varies by model.

max_tokens

integer

Maximum number of tokens to generate in the completion.

max_completion_tokens

integer

Maximum number of completion tokens to generate (alternative to max_tokens).

top_p

number

Nucleus sampling parameter. Alternative to temperature. Value between 0 and 1.

top_k

number

Top-K sampling parameter for limiting token selection.

stream

boolean

default:"false"

Whether to stream the response as Server-Sent Events (SSE).

stream_options

object

Options for streaming responses.

Show properties

include_usage

boolean

Include token usage information in the stream.

include_obfuscation

boolean

Include obfuscation metadata in the stream.

stop

string | array

Up to 4 sequences where the API will stop generating further tokens.

integer

default:"1"

Number of chat completion choices to generate (1-128).

presence_penalty

number

default:"0"

Penalize new tokens based on whether they appear in the text so far (-2.0 to 2.0).

frequency_penalty

number

default:"0"

Penalize new tokens based on their frequency in the text so far (-2.0 to 2.0).

logit_bias

object

Modify the likelihood of specified tokens appearing in the completion.

logprobs

boolean

default:"false"

Whether to return log probabilities of output tokens.

top_logprobs

integer

Number of most likely tokens to return at each position (0-20). Requires logprobs: true.

response_format

object

Format for the model’s output.

Show properties

type

string

One of: text, json_object, json_schema

json_schema

object

JSON schema definition when type is json_schema.

tools

array

List of tools the model can call. Use this for function calling.

Show Tool types

type

string

Tool type: function or custom

function

object

Function definition with name, description, and parameters.

tool_choice

string | object

Controls which tool the model should use. Options: none, auto, required, or specific tool.

parallel_tool_calls

boolean

default:"true"

Whether to enable parallel function calling.

user

string

Unique identifier for the end-user, for monitoring and abuse detection.

seed

integer

Random seed for deterministic sampling.

service_tier

string

Service tier to use. Options: auto, default, flex, scale, priority

reasoning_effort

string

Amount of reasoning effort for reasoning models. Options: minimal, low, medium, high

reasoning_options

object

Options for reasoning models.

Show properties

budget_tokens

integer

Token budget for reasoning.

metadata

object

Custom metadata to attach to the request for tracking and filtering in Helicone.

cache_control

object

Cache control settings for prompt caching.

Show properties

type

string

Cache type: ephemeral

ttl

string

Time-to-live for the cache.

prompt_cache_key

string

Key for prompt caching to reuse previous prompts.

Response Format

Non-Streaming Response

string

Unique identifier for the completion.

object

string

Object type, always chat.completion.

created

integer

Unix timestamp of when the completion was created.

model

string

The model used for completion.

choices

array

Array of completion choices.

Show Choice properties

index

integer

Index of the choice.

message

object

The generated message.

Show Message properties

role

string

Role of the message (e.g., assistant).

content

string

The content of the message.

tool_calls

array

Tool calls made by the model (if any).

finish_reason

string

Reason the completion finished: stop, length, tool_calls, content_filter

usage

object

Token usage information.

Show Usage properties

prompt_tokens

integer

Number of tokens in the prompt.

completion_tokens

integer

Number of tokens in the completion.

total_tokens

integer

Total number of tokens used.

Streaming Response

When stream: true, the response is returned as Server-Sent Events (SSE). Each event contains a JSON object with:

{
  "id": "chatcmpl-123",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "gpt-4",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "Hello"
      },
      "finish_reason": null
    }
  ]
}

The stream ends with a [DONE] message.

Error Responses

error

object

Error information when a request fails.

Show Error properties

message

string

Human-readable error message.

type

string

Error type (e.g., invalid_request_error, authentication_error).

code

string

Error code for programmatic handling.

Example Responses

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm doing well, thank you for asking. How can I assist you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 13,
    "completion_tokens": 17,
    "total_tokens": 30
  }
}

Advanced Features

Function Calling

Define tools that the model can use:

{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "What's the weather in Boston?"}],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string"}
          },
          "required": ["location"]
        }
      }
    }
  ]
}

Vision (Image Input)

Include images in your messages:

{
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/image.jpg",
            "detail": "high"
          }
        }
      ]
    }
  ]
}

JSON Mode

Force the model to output valid JSON:

{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Generate a user profile"}],
  "response_format": {"type": "json_object"}
}

Rate Limits

Rate limits are applied at the organization level and vary based on your Helicone plan. Monitor your usage through the Helicone dashboard.

Best Practices

Always include error handling for API calls
Use streaming for better user experience with long responses
Set appropriate max_tokens to control costs
Use metadata to track and filter requests in Helicone
Implement retry logic with exponential backoff for transient errors

Gateway

Requests

Sessions

Prompts

Evaluations

Overview

Authentication

Endpoint

Request Parameters

Response Format

Non-Streaming Response

Streaming Response

Error Responses

Example Responses

Advanced Features

Function Calling

Vision (Image Input)

JSON Mode

Rate Limits

Best Practices

Build docs developers (and LLMs) love

Gateway

Requests

Sessions

Prompts

Evaluations

​Overview

​Authentication

​Endpoint

​Request Parameters

​Response Format

​Non-Streaming Response

​Streaming Response

​Error Responses

​Example Responses

​Advanced Features

​Function Calling

​Vision (Image Input)

​JSON Mode

​Rate Limits

​Best Practices

Build docs developers (and LLMs) love

Overview

Authentication

Endpoint

Request Parameters

Response Format

Non-Streaming Response

Streaming Response

Error Responses

Example Responses

Advanced Features

Function Calling

Vision (Image Input)

JSON Mode

Rate Limits

Best Practices