1000 requests per day or 60 requests per minute. By implementing rate limits, you can prevent abuse while protecting your resources from being overwhelmed by excessive traffic.
Why Rate Limit
- Prevent abuse of the API: Limit the total requests a user can make in a given period to control cost.
- Protect resources from excessive traffic: Maintain availability for all users.
- Control operational cost: Limit the total number of requests sent and total cost.
- Comply with third-party API usage policies: Each model provider has their own rate limit for your key. Helicone’s rate limit is bounded by your provider’s policy.
Quick Start
Set up rate limiting by adding theHelicone-RateLimit-Policy header to your requests:
Configuration Reference
TheHelicone-RateLimit-Policy header uses this format:
Parameters
Maximum number of requests (or cost in cents) allowed within the time window.Example:
1000 for 1000 requestsTime window in seconds. Minimum is 60 seconds.Example:
3600 for 1 hour, 86400 for 1 dayUnit type:
request (default) or cents for cost-based limiting.Example: u=cents to limit by spending instead of request countSegment type:
user for per-user limits, or custom property name for per-property limits. Omit for global limits.Example: s=user or s=organizationRate Limiting Scopes
Helicone supports three types of rate limiting based on who or what you want to limit:Global Rate Limiting
Applies the same limit across all requests using your API key. Use case: “Limit my entire application to 10,000 requests per hour”Per-User Rate Limiting
Applies separate limits for each user ID. Use case: “Each user can make 1,000 requests per day”Per-Property Rate Limiting
Applies separate limits for each custom property value. Use case: “Each organization can make 5,000 requests per hour”Common Use Cases
Global Application Limits
Limit your entire application’s usage:Per-User Limits
Limit each user individually:Per-user rate limiting requires the
Helicone-User-Id header. See User Metrics for more details.Cost-Based Limits
Limit by spending instead of request count:Custom Property Limits
Limit by custom properties like organization or tier:Extracting Rate Limit Response Headers
Extracting the headers allows you to test your rate limit policy in a local environment before deploying to production. If your rate limit policy is active, the following headers will be returned:Helicone-RateLimit-Limit: The quota for the number of requests allowed in the time window.Helicone-RateLimit-Policy: The active rate limit policy.Helicone-RateLimit-Remaining: The remaining quota in the current window.
If a request is rate-limited, a 429 rate limit error will be returned.
Rate Limit Dashboard
Monitor your rate limit usage in the Helicone dashboard:- Rate limit occurrences - Number of requests that hit rate limits
- Trends over time - Visualize rate limit patterns
- By user/property - See which users or segments are being limited
- Tier information - View limits for different subscription tiers
Rate Limit Tiers
Helicone enforces rate limits on logging to prevent overwhelming our infrastructure:| Tier | Rate Limit |
|---|---|
| Free | 834 logs / 5 seconds |
| Pro | 8,334 logs / 5 seconds |
| Enterprise | Custom |
Important: These limits apply to logging only. Your requests are never dropped - Helicone will always forward your request to the provider even if logging is rate-limited.
Latency Considerations
Using rate limits adds a small amount of latency to your requests. This feature is deployed with Cloudflare’s key-value data store, which is a low-latency service that stores data in a small number of centralized data centers and caches that data in Cloudflare’s data centers after access. The latency add-on is minimal compared to multi-second LLM requests.Best Practices
Start Conservative
Begin with higher limits and tighten based on actual usage patterns
Use Per-User Limits
Prevent individual users from consuming all resources
Cost-Based for Expensive Models
Use cost limits for expensive models like GPT-4 to control spending
Monitor and Adjust
Regularly review rate limit hits and adjust thresholds accordingly
Coming Soon
- Token-based rate limiting - Limit by number of tokens instead of just request count or cost
- Multiple rate limit policies - Apply multiple rate limiting criteria to a single request (e.g., limit by both request count AND cost simultaneously)
Questions?
If you have any questions or need help, please reach out to us:- Join our Discord community
- Email us at [email protected]
- Check out our GitHub repository