Overview
Rate limits determine how many requests you can make to an API within a specific time window. Understanding and optimizing for these limits is crucial for building reliable applications with free LLM APIs.
Types of Rate Limits
Request-Based Limits the number of API calls per time period (e.g., 20 requests/minute)
Token-Based Limits the number of tokens processed per time period (e.g., 60,000 tokens/minute)
Quota-Based Total allowance over longer periods (e.g., 1,000 requests/month)
Provider Rate Limits Comparison
High Volume Providers
Providers with 10,000+ Requests/Day
Provider Requests/Minute Requests/Day Tokens/Minute Notes Cerebras 30 14,400 60,000-64,000 Varies by model Google AI Studio (Gemma) 30 14,400 15,000 Per model Groq (Llama 3.1 8B) - 14,400 6,000 - Groq (Llama Guard) - 14,400 15,000 -
These providers are suitable for production applications with consistent traffic.
Medium Volume Providers
Providers with 1,000-10,000 Requests/Day
Provider Requests/Minute Requests/Day Tokens/Minute Notes Mistral Codestral 30 2,000 - Code-focused Groq (Llama 3.3 70B) - 1,000 12,000 Larger model Groq (Compound) - 250 70,000 High token limit
Good for development and small to medium applications.
Low Volume Providers
Providers with <1,000 Requests/Day
Provider Requests/Minute Requests/Day Monthly Quota Notes OpenRouter 20 50 - 1000/day with $10 topup Cohere 20 - 1,000/month Shared across models Google AI Studio (Gemini) 5-15 20-500 - Varies by model NVIDIA NIM 40 - - Per-minute only
These are best for prototyping and low-traffic applications.
Token Limit Strategies
High Token Throughput Providers
Provider Tokens/Minute Tokens/Day Tokens/Month Best For Mistral La Plateforme 500,000 - 1,000,000,000 Long documents, content generation Google AI Studio (Gemini) 250,000 - - Multimodal, analysis Groq (Compound) 70,000 - - Fast inference Cerebras 60,000-64,000 1,000,000 - Consistent high volume
Token limits are often more important than request limits for applications processing large documents or generating long-form content.
Handling Rate Limits
1. Exponential Backoff
Implement retry logic with exponential backoff:
import time
import random
from openai import OpenAI, RateLimitError
def call_with_backoff ( client , model , messages , max_retries = 5 ):
for attempt in range (max_retries):
try :
response = client.chat.completions.create(
model = model,
messages = messages
)
return response
except RateLimitError as e:
if attempt == max_retries - 1 :
raise
# Exponential backoff: 1s, 2s, 4s, 8s, 16s
wait_time = ( 2 ** attempt) + random.uniform( 0 , 1 )
print ( f "Rate limited. Waiting { wait_time :.2f} s..." )
time.sleep(wait_time)
# Usage
client = OpenAI(
base_url = "https://openrouter.ai/api/v1" ,
api_key = "your-api-key"
)
response = call_with_backoff(
client,
"meta-llama/llama-3.3-70b-instruct:free" ,
[{ "role" : "user" , "content" : "Hello!" }]
)
Monitor rate limit headers in API responses:
response = requests.post(
"https://api.groq.com/openai/v1/chat/completions" ,
headers = { "Authorization" : f "Bearer { api_key } " },
json = { "model" : "llama-3.3-70b-versatile" , "messages" : messages}
)
# Check rate limit headers
remaining = response.headers.get( 'x-ratelimit-remaining-requests' )
limit = response.headers.get( 'x-ratelimit-limit-requests' )
reset = response.headers.get( 'x-ratelimit-reset-requests' )
print ( f "Remaining: { remaining } / { limit } , Resets: { reset } " )
Not all providers return rate limit headers. Check the provider’s documentation for specifics.
3. Request Queuing
Implement a queue to stay within rate limits:
import asyncio
from asyncio import Queue, Semaphore
import time
class RateLimiter :
def __init__ ( self , max_per_minute = 20 ):
self .max_per_minute = max_per_minute
self .semaphore = Semaphore(max_per_minute)
self .requests = []
async def acquire ( self ):
# Remove requests older than 1 minute
now = time.time()
self .requests = [t for t in self .requests if now - t < 60 ]
# Wait if at limit
while len ( self .requests) >= self .max_per_minute:
await asyncio.sleep( 1 )
self .requests = [t for t in self .requests if time.time() - t < 60 ]
self .requests.append(time.time())
async def call_api ( self , client , model , messages ):
await self .acquire()
return await client.chat.completions.create(
model = model,
messages = messages
)
# Usage
limiter = RateLimiter( max_per_minute = 20 ) # OpenRouter limit
response = await limiter.call_api(client, model, messages)
4. Token Counting
Count tokens before sending requests:
import tiktoken
def count_tokens ( text , model = "gpt-3.5-turbo" ):
encoding = tiktoken.encoding_for_model(model)
return len (encoding.encode(text))
def estimate_cost ( prompt , max_tokens = 100 ):
input_tokens = count_tokens(prompt)
total_tokens = input_tokens + max_tokens
return total_tokens
# Check before calling
prompt = "Explain quantum computing"
if estimate_cost(prompt) > 4000 : # Stay under token limit
print ( "Prompt too long, truncating..." )
prompt = prompt[: 2000 ] # Rough truncation
Multi-Provider Load Balancing
Distribute requests across multiple providers:
from typing import List, Dict
import random
class MultiProviderClient :
def __init__ ( self , providers : List[Dict]):
"""
providers = [
{
'name': 'openrouter',
'client': openrouter_client,
'model': 'meta-llama/llama-3.3-70b-instruct:free',
'limit_per_day': 50
},
{
'name': 'groq',
'client': groq_client,
'model': 'llama-3.3-70b-versatile',
'limit_per_day': 1000
}
]
"""
self .providers = providers
self .usage = {p[ 'name' ]: 0 for p in providers}
def get_available_provider ( self ):
# Filter providers under their daily limit
available = [
p for p in self .providers
if self .usage[p[ 'name' ]] < p[ 'limit_per_day' ]
]
if not available:
raise Exception ( "All providers exhausted" )
# Weighted random selection based on remaining quota
weights = [
p[ 'limit_per_day' ] - self .usage[p[ 'name' ]]
for p in available
]
return random.choices(available, weights = weights)[ 0 ]
def call ( self , messages ):
provider = self .get_available_provider()
self .usage[provider[ 'name' ]] += 1
return provider[ 'client' ].chat.completions.create(
model = provider[ 'model' ],
messages = messages
)
# Usage
from openai import OpenAI
providers = [
{
'name' : 'openrouter' ,
'client' : OpenAI( base_url = "https://openrouter.ai/api/v1" , api_key = os.getenv( "OPENROUTER_KEY" )),
'model' : 'meta-llama/llama-3.3-70b-instruct:free' ,
'limit_per_day' : 50
},
{
'name' : 'groq' ,
'client' : OpenAI( base_url = "https://api.groq.com/openai/v1" , api_key = os.getenv( "GROQ_KEY" )),
'model' : 'llama-3.3-70b-versatile' ,
'limit_per_day' : 1000
},
{
'name' : 'cerebras' ,
'client' : OpenAI( base_url = "https://api.cerebras.ai/v1" , api_key = os.getenv( "CEREBRAS_KEY" )),
'model' : 'llama3.3-70b' ,
'limit_per_day' : 14400
}
]
multi_client = MultiProviderClient(providers)
response = multi_client.call([{ "role" : "user" , "content" : "Hello!" }])
Pro Tip: Use Vercel AI Gateway for automatic provider routing, caching, and observability.
Monitoring and Alerts
Track your usage to avoid hitting limits:
import logging
from datetime import datetime
class UsageTracker :
def __init__ ( self ):
self .daily_count = 0
self .last_reset = datetime.now().date()
self .logger = logging.getLogger( __name__ )
def track_request ( self , provider , tokens_used ):
# Reset daily counter
today = datetime.now().date()
if today > self .last_reset:
self .daily_count = 0
self .last_reset = today
self .daily_count += 1
# Log usage
self .logger.info(
f "Provider: { provider } , "
f "Daily requests: { self .daily_count } , "
f "Tokens: { tokens_used } "
)
# Alert if approaching limit (e.g., 80% of OpenRouter's 50/day)
if provider == 'openrouter' and self .daily_count >= 40 :
self .logger.warning(
f "Approaching daily limit for { provider } : "
f " { self .daily_count } /50 requests used"
)
tracker = UsageTracker()
tracker.track_request( 'openrouter' , 150 )
Best Practices
Cache Responses Store and reuse responses for identical requests to reduce API calls
Batch Requests Group multiple queries when possible to maximize token efficiency
Use Streaming Stream responses to provide faster perceived performance without extra requests
Implement Fallbacks Have backup providers ready when primary provider hits limits
Reduce token usage:
Use concise system messages
Avoid repeating context in conversations
Use smaller models when appropriate
Truncate long inputs intelligently
# Bad: Repeating full context every time
messages = [
{ "role" : "system" , "content" : "You are a helpful assistant with extensive knowledge..." },
{ "role" : "user" , "content" : "Question 1" },
{ "role" : "assistant" , "content" : "Answer 1" },
{ "role" : "user" , "content" : "Question 2" }
]
# Good: Minimal context, reference previous answers
messages = [
{ "role" : "system" , "content" : "Helpful assistant." },
{ "role" : "user" , "content" : "Q1" }
]
Spread requests throughout the day: For providers with daily limits (e.g., Cerebras with 14,400 req/day):
Max sustainable rate: ~10 requests/minute
Consider off-peak processing for batch jobs
Queue non-urgent requests
from datetime import datetime
def should_process_now ( priority = 'normal' ):
hour = datetime.now().hour
# Process high priority immediately
if priority == 'high' :
return True
# Queue normal requests during peak hours (9am-5pm)
if 9 <= hour <= 17 :
return False
return True
Provider-Specific Tips
OpenRouter
Mistral La Plateforme
Cerebras
Groq
Rate Limit: 20 req/min, 50 req/day (1000 with $10 topup)Optimization Strategies:
Use the $10 lifetime topup for 20x increase
Models share a common quota - choose wisely
Consider switching to Groq for high-volume workloads
OpenRouter’s topup is one-time and never expires, making it excellent value.
Rate Limit: 1 req/sec, 500K tokens/min, 1B tokens/month per modelOptimization Strategies:
Token limits are per-model - use multiple models
1B tokens/month is ~33M tokens/day
Best for high-token workloads
Free tier requires opting into data training.
Rate Limit: 30 req/min, 14,400 req/day, 1M tokens/dayOptimization Strategies:
One of the highest daily request limits
Token limit shared across models
Great for production workloads
Consider for primary provider
Rate Limit: Varies by model (250-14,400 req/day)Optimization Strategies:
Use smaller models (Llama 3.1 8B) for 14,400 req/day
Compound models have highest token limits (70K/min)
Ultra-fast inference - ideal for real-time apps
Use model-specific limits strategically
Next Steps
Best Practices Learn more optimization techniques and security practices
Provider Comparison Compare providers to find the best fit for your use case