Skip to main content

Overview

Rate limits determine how many requests you can make to an API within a specific time window. Understanding and optimizing for these limits is crucial for building reliable applications with free LLM APIs.

Types of Rate Limits

Request-Based

Limits the number of API calls per time period (e.g., 20 requests/minute)

Token-Based

Limits the number of tokens processed per time period (e.g., 60,000 tokens/minute)

Quota-Based

Total allowance over longer periods (e.g., 1,000 requests/month)

Provider Rate Limits Comparison

High Volume Providers

ProviderRequests/MinuteRequests/DayTokens/MinuteNotes
Cerebras3014,40060,000-64,000Varies by model
Google AI Studio (Gemma)3014,40015,000Per model
Groq (Llama 3.1 8B)-14,4006,000-
Groq (Llama Guard)-14,40015,000-
These providers are suitable for production applications with consistent traffic.

Medium Volume Providers

ProviderRequests/MinuteRequests/DayTokens/MinuteNotes
Mistral Codestral302,000-Code-focused
Groq (Llama 3.3 70B)-1,00012,000Larger model
Groq (Compound)-25070,000High token limit
Good for development and small to medium applications.

Low Volume Providers

ProviderRequests/MinuteRequests/DayMonthly QuotaNotes
OpenRouter2050-1000/day with $10 topup
Cohere20-1,000/monthShared across models
Google AI Studio (Gemini)5-1520-500-Varies by model
NVIDIA NIM40--Per-minute only
These are best for prototyping and low-traffic applications.

Token Limit Strategies

High Token Throughput Providers

ProviderTokens/MinuteTokens/DayTokens/MonthBest For
Mistral La Plateforme500,000-1,000,000,000Long documents, content generation
Google AI Studio (Gemini)250,000--Multimodal, analysis
Groq (Compound)70,000--Fast inference
Cerebras60,000-64,0001,000,000-Consistent high volume
Token limits are often more important than request limits for applications processing large documents or generating long-form content.

Handling Rate Limits

1. Exponential Backoff

Implement retry logic with exponential backoff:
import time
import random
from openai import OpenAI, RateLimitError

def call_with_backoff(client, model, messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages
            )
            return response
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            
            # Exponential backoff: 1s, 2s, 4s, 8s, 16s
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Waiting {wait_time:.2f}s...")
            time.sleep(wait_time)

# Usage
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="your-api-key"
)

response = call_with_backoff(
    client,
    "meta-llama/llama-3.3-70b-instruct:free",
    [{"role": "user", "content": "Hello!"}]
)

2. Rate Limit Headers

Monitor rate limit headers in API responses:
response = requests.post(
    "https://api.groq.com/openai/v1/chat/completions",
    headers={"Authorization": f"Bearer {api_key}"},
    json={"model": "llama-3.3-70b-versatile", "messages": messages}
)

# Check rate limit headers
remaining = response.headers.get('x-ratelimit-remaining-requests')
limit = response.headers.get('x-ratelimit-limit-requests')
reset = response.headers.get('x-ratelimit-reset-requests')

print(f"Remaining: {remaining}/{limit}, Resets: {reset}")
Not all providers return rate limit headers. Check the provider’s documentation for specifics.

3. Request Queuing

Implement a queue to stay within rate limits:
import asyncio
from asyncio import Queue, Semaphore
import time

class RateLimiter:
    def __init__(self, max_per_minute=20):
        self.max_per_minute = max_per_minute
        self.semaphore = Semaphore(max_per_minute)
        self.requests = []
    
    async def acquire(self):
        # Remove requests older than 1 minute
        now = time.time()
        self.requests = [t for t in self.requests if now - t < 60]
        
        # Wait if at limit
        while len(self.requests) >= self.max_per_minute:
            await asyncio.sleep(1)
            self.requests = [t for t in self.requests if time.time() - t < 60]
        
        self.requests.append(time.time())
    
    async def call_api(self, client, model, messages):
        await self.acquire()
        return await client.chat.completions.create(
            model=model,
            messages=messages
        )

# Usage
limiter = RateLimiter(max_per_minute=20)  # OpenRouter limit
response = await limiter.call_api(client, model, messages)

4. Token Counting

Count tokens before sending requests:
import tiktoken

def count_tokens(text, model="gpt-3.5-turbo"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def estimate_cost(prompt, max_tokens=100):
    input_tokens = count_tokens(prompt)
    total_tokens = input_tokens + max_tokens
    return total_tokens

# Check before calling
prompt = "Explain quantum computing"
if estimate_cost(prompt) > 4000:  # Stay under token limit
    print("Prompt too long, truncating...")
    prompt = prompt[:2000]  # Rough truncation

Multi-Provider Load Balancing

Distribute requests across multiple providers:
from typing import List, Dict
import random

class MultiProviderClient:
    def __init__(self, providers: List[Dict]):
        """
        providers = [
            {
                'name': 'openrouter',
                'client': openrouter_client,
                'model': 'meta-llama/llama-3.3-70b-instruct:free',
                'limit_per_day': 50
            },
            {
                'name': 'groq',
                'client': groq_client,
                'model': 'llama-3.3-70b-versatile',
                'limit_per_day': 1000
            }
        ]
        """
        self.providers = providers
        self.usage = {p['name']: 0 for p in providers}
    
    def get_available_provider(self):
        # Filter providers under their daily limit
        available = [
            p for p in self.providers
            if self.usage[p['name']] < p['limit_per_day']
        ]
        
        if not available:
            raise Exception("All providers exhausted")
        
        # Weighted random selection based on remaining quota
        weights = [
            p['limit_per_day'] - self.usage[p['name']]
            for p in available
        ]
        return random.choices(available, weights=weights)[0]
    
    def call(self, messages):
        provider = self.get_available_provider()
        self.usage[provider['name']] += 1
        
        return provider['client'].chat.completions.create(
            model=provider['model'],
            messages=messages
        )

# Usage
from openai import OpenAI

providers = [
    {
        'name': 'openrouter',
        'client': OpenAI(base_url="https://openrouter.ai/api/v1", api_key=os.getenv("OPENROUTER_KEY")),
        'model': 'meta-llama/llama-3.3-70b-instruct:free',
        'limit_per_day': 50
    },
    {
        'name': 'groq',
        'client': OpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_KEY")),
        'model': 'llama-3.3-70b-versatile',
        'limit_per_day': 1000
    },
    {
        'name': 'cerebras',
        'client': OpenAI(base_url="https://api.cerebras.ai/v1", api_key=os.getenv("CEREBRAS_KEY")),
        'model': 'llama3.3-70b',
        'limit_per_day': 14400
    }
]

multi_client = MultiProviderClient(providers)
response = multi_client.call([{"role": "user", "content": "Hello!"}])
Pro Tip: Use Vercel AI Gateway for automatic provider routing, caching, and observability.

Monitoring and Alerts

Track your usage to avoid hitting limits:
import logging
from datetime import datetime

class UsageTracker:
    def __init__(self):
        self.daily_count = 0
        self.last_reset = datetime.now().date()
        self.logger = logging.getLogger(__name__)
    
    def track_request(self, provider, tokens_used):
        # Reset daily counter
        today = datetime.now().date()
        if today > self.last_reset:
            self.daily_count = 0
            self.last_reset = today
        
        self.daily_count += 1
        
        # Log usage
        self.logger.info(
            f"Provider: {provider}, "
            f"Daily requests: {self.daily_count}, "
            f"Tokens: {tokens_used}"
        )
        
        # Alert if approaching limit (e.g., 80% of OpenRouter's 50/day)
        if provider == 'openrouter' and self.daily_count >= 40:
            self.logger.warning(
                f"Approaching daily limit for {provider}: "
                f"{self.daily_count}/50 requests used"
            )

tracker = UsageTracker()
tracker.track_request('openrouter', 150)

Best Practices

Cache Responses

Store and reuse responses for identical requests to reduce API calls

Batch Requests

Group multiple queries when possible to maximize token efficiency

Use Streaming

Stream responses to provide faster perceived performance without extra requests

Implement Fallbacks

Have backup providers ready when primary provider hits limits
Reduce token usage:
  • Use concise system messages
  • Avoid repeating context in conversations
  • Use smaller models when appropriate
  • Truncate long inputs intelligently
# Bad: Repeating full context every time
messages = [
    {"role": "system", "content": "You are a helpful assistant with extensive knowledge..."},
    {"role": "user", "content": "Question 1"},
    {"role": "assistant", "content": "Answer 1"},
    {"role": "user", "content": "Question 2"}
]

# Good: Minimal context, reference previous answers
messages = [
    {"role": "system", "content": "Helpful assistant."},
    {"role": "user", "content": "Q1"}
]
Spread requests throughout the day:For providers with daily limits (e.g., Cerebras with 14,400 req/day):
  • Max sustainable rate: ~10 requests/minute
  • Consider off-peak processing for batch jobs
  • Queue non-urgent requests
from datetime import datetime

def should_process_now(priority='normal'):
    hour = datetime.now().hour
    
    # Process high priority immediately
    if priority == 'high':
        return True
    
    # Queue normal requests during peak hours (9am-5pm)
    if 9 <= hour <= 17:
        return False
    
    return True

Provider-Specific Tips

Rate Limit: 20 req/min, 50 req/day (1000 with $10 topup)Optimization Strategies:
  • Use the $10 lifetime topup for 20x increase
  • Models share a common quota - choose wisely
  • Consider switching to Groq for high-volume workloads
OpenRouter’s topup is one-time and never expires, making it excellent value.

Next Steps

Best Practices

Learn more optimization techniques and security practices

Provider Comparison

Compare providers to find the best fit for your use case

Build docs developers (and LLMs) love