Skip to main content

Overview

Helicone’s rate limiting feature allows you to control API usage by enforcing request or cost-based quotas. Set limits per user, organization, custom property, or globally to prevent overuse and manage budgets.
Rate limiting is ideal for:
  • Preventing individual users from exceeding quotas
  • Managing costs across teams or departments
  • Enforcing fair usage in multi-tenant applications
  • Protecting against runaway API consumption

Key Benefits

Flexible Policies

Rate limit by requests, cost (in cents), time windows, and custom segments

Fine-Grained Control

Apply limits per user, property, or globally across your entire organization

Cost Management

Set budget limits in cents to prevent unexpected spending

Instant Feedback

Users receive immediate 429 responses when limits are exceeded with retry information

Quick Start

Apply rate limiting by adding the Helicone-RateLimit-Policy header:
import { OpenAI } from "openai";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    // 100 requests per minute
    "Helicone-RateLimit-Policy": "100;w=60",
  },
});

try {
  const response = await client.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Hello!" }],
  });
} catch (error) {
  if (error.status === 429) {
    console.log("Rate limit exceeded!");
    console.log("Retry after:", error.headers.get("Retry-After"));
  }
}

Policy Format

The Helicone-RateLimit-Policy header uses this format:
[quota];w=[window_seconds];u=[unit];s=[segment]

Required Parameters

  • quota: Maximum number of requests or cents allowed
  • w: Time window in seconds (minimum: 60, maximum: 31536000)

Optional Parameters

  • u: Unit of measurement (request or cents, default: request)
  • s: Segmentation type (user, custom property name, or omit for global)

Policy Examples

Request-Based Limits

// 1000 requests per hour (global)
"Helicone-RateLimit-Policy": "1000;w=3600"

// 100 requests per minute per user
"Helicone-RateLimit-Policy": "100;w=60;s=user"

// 5000 requests per day per organization
"Helicone-RateLimit-Policy": "5000;w=86400;s=organization"

Cost-Based Limits

// $50 per day (5000 cents, global)
"Helicone-RateLimit-Policy": "5000;w=86400;u=cents"

// $10 per hour per user (1000 cents)
"Helicone-RateLimit-Policy": "1000;w=3600;u=cents;s=user"

// $0.50 per minute for testing (0.5 cents)
"Helicone-RateLimit-Policy": "0.5;w=60;u=cents"

Custom Property Segments

// 200 requests per hour per team
"Helicone-RateLimit-Policy": "200;w=3600;s=team"
"Helicone-Property-Team": "engineering"

// 1000 requests per day per customer
"Helicone-RateLimit-Policy": "1000;w=86400;s=customer-id"
"Helicone-Property-Customer-Id": "acme-corp"

Segmentation Types

Rate limit applies across all requests:
const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    "Helicone-RateLimit-Policy": "10000;w=3600",
  },
});

Response Headers

When rate limiting is active, Helicone adds headers to every response:
HeaderDescriptionExample
X-RateLimit-LimitMaximum quota for the time window100
X-RateLimit-RemainingRemaining quota in current window87
X-RateLimit-ResetUnix timestamp when window resets1678901234
X-RateLimit-PolicyActive policy string100;w=60;s=user

Rate Limit Exceeded (429 Response)

When limits are exceeded:
{
  "status": 429,
  "headers": {
    "X-RateLimit-Limit": "100",
    "X-RateLimit-Remaining": "0",
    "X-RateLimit-Reset": "1678901294",
    "Retry-After": "60"
  },
  "body": {
    "error": {
      "message": "Rate limit exceeded",
      "type": "rate_limit_exceeded"
    }
  }
}

Common Time Windows

PeriodSecondsExample Policy
1 minute60100;w=60
5 minutes300500;w=300
1 hour36001000;w=3600
1 day8640010000;w=86400
1 week60480050000;w=604800
1 month (30 days)2592000200000;w=2592000

Advanced Use Cases

Implement different limits for different user tiers:
const getRateLimitPolicy = (userTier: string) => {
  const policies = {
    free: "100;w=3600;s=user",
    pro: "1000;w=3600;s=user",
    enterprise: "10000;w=3600;s=user",
  };
  return policies[userTier] || policies.free;
};

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    "Helicone-RateLimit-Policy": getRateLimitPolicy(user.tier),
    "Helicone-User-Id": user.id,
  },
});
Set cost limits per department:
const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    // $100 per day per department
    "Helicone-RateLimit-Policy": "10000;w=86400;u=cents;s=department",
    "Helicone-Property-Department": department,
  },
});
Handle rate limits with fallback logic:
async function makeRequest(prompt: string, retries = 3) {
  try {
    return await client.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [{ role: "user", content: prompt }],
    });
  } catch (error) {
    if (error.status === 429 && retries > 0) {
      const retryAfter = parseInt(error.headers.get("Retry-After") || "60");
      await new Promise(resolve => setTimeout(resolve, retryAfter * 1000));
      return makeRequest(prompt, retries - 1);
    }
    throw error;
  }
}
Use decimal values for testing cost limits:
// $0.01 per minute for integration tests
"Helicone-RateLimit-Policy": "1;w=60;u=cents"

// $0.005 per minute (half a cent)
"Helicone-RateLimit-Policy": "0.5;w=60;u=cents"

Best Practices

  1. Start conservative: Begin with lower limits and increase based on usage patterns
  2. Monitor metrics: Track rate limit hits in your Helicone dashboard
  3. Implement retry logic: Handle 429 responses gracefully with exponential backoff
  4. Use appropriate segments: Choose user, property, or global based on your use case
  5. Set realistic windows: Align time windows with your application’s usage patterns
  6. Combine with caching: Use caching to reduce requests and stay under limits

Limitations

  • Minimum time window: 60 seconds
  • Maximum time window: 31,536,000 seconds (1 year)
  • Policy validation happens on every request
  • Cost-based limits use Helicone’s cost calculations

Caching

Reduce requests with intelligent caching

Webhooks

Get notified when limits are exceeded

Build docs developers (and LLMs) love