Understanding Rate Limits

Overview

Rate limits determine how many requests you can make to an API within a specific time window. Understanding and optimizing for these limits is crucial for building reliable applications with free LLM APIs.

Types of Rate Limits

Request-Based

Limits the number of API calls per time period (e.g., 20 requests/minute)

Token-Based

Limits the number of tokens processed per time period (e.g., 60,000 tokens/minute)

Quota-Based

Total allowance over longer periods (e.g., 1,000 requests/month)

Provider Rate Limits Comparison

High Volume Providers

Providers with 10,000+ Requests/Day

Provider	Requests/Minute	Requests/Day	Tokens/Minute	Notes
Cerebras	30	14,400	60,000-64,000	Varies by model
Google AI Studio (Gemma)	30	14,400	15,000	Per model
Groq (Llama 3.1 8B)	-	14,400	6,000	-
Groq (Llama Guard)	-	14,400	15,000	-

These providers are suitable for production applications with consistent traffic.

Medium Volume Providers

Providers with 1,000-10,000 Requests/Day

Provider	Requests/Minute	Requests/Day	Tokens/Minute	Notes
Mistral Codestral	30	2,000	-	Code-focused
Groq (Llama 3.3 70B)	-	1,000	12,000	Larger model
Groq (Compound)	-	250	70,000	High token limit

Good for development and small to medium applications.

Low Volume Providers

Providers with <1,000 Requests/Day

Provider	Requests/Minute	Requests/Day	Monthly Quota	Notes
OpenRouter	20	50	-	1000/day with $10 topup
Cohere	20	-	1,000/month	Shared across models
Google AI Studio (Gemini)	5-15	20-500	-	Varies by model
NVIDIA NIM	40	-	-	Per-minute only

These are best for prototyping and low-traffic applications.

Token Limit Strategies

High Token Throughput Providers

Provider	Tokens/Minute	Tokens/Day	Tokens/Month	Best For
Mistral La Plateforme	500,000	-	1,000,000,000	Long documents, content generation
Google AI Studio (Gemini)	250,000	-	-	Multimodal, analysis
Groq (Compound)	70,000	-	-	Fast inference
Cerebras	60,000-64,000	1,000,000	-	Consistent high volume

Token limits are often more important than request limits for applications processing large documents or generating long-form content.

Handling Rate Limits

1. Exponential Backoff

Implement retry logic with exponential backoff:

import time
import random
from openai import OpenAI, RateLimitError

def call_with_backoff(client, model, messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages
            )
            return response
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            
            # Exponential backoff: 1s, 2s, 4s, 8s, 16s
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Waiting {wait_time:.2f}s...")
            time.sleep(wait_time)

# Usage
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="your-api-key"
)

response = call_with_backoff(
    client,
    "meta-llama/llama-3.3-70b-instruct:free",
    [{"role": "user", "content": "Hello!"}]
)

2. Rate Limit Headers

Monitor rate limit headers in API responses:

response = requests.post(
    "https://api.groq.com/openai/v1/chat/completions",
    headers={"Authorization": f"Bearer {api_key}"},
    json={"model": "llama-3.3-70b-versatile", "messages": messages}
)

# Check rate limit headers
remaining = response.headers.get('x-ratelimit-remaining-requests')
limit = response.headers.get('x-ratelimit-limit-requests')
reset = response.headers.get('x-ratelimit-reset-requests')

print(f"Remaining: {remaining}/{limit}, Resets: {reset}")

Not all providers return rate limit headers. Check the provider’s documentation for specifics.

3. Request Queuing

Implement a queue to stay within rate limits:

import asyncio
from asyncio import Queue, Semaphore
import time

class RateLimiter:
    def __init__(self, max_per_minute=20):
        self.max_per_minute = max_per_minute
        self.semaphore = Semaphore(max_per_minute)
        self.requests = []
    
    async def acquire(self):
        # Remove requests older than 1 minute
        now = time.time()
        self.requests = [t for t in self.requests if now - t < 60]
        
        # Wait if at limit
        while len(self.requests) >= self.max_per_minute:
            await asyncio.sleep(1)
            self.requests = [t for t in self.requests if time.time() - t < 60]
        
        self.requests.append(time.time())
    
    async def call_api(self, client, model, messages):
        await self.acquire()
        return await client.chat.completions.create(
            model=model,
            messages=messages
        )

# Usage
limiter = RateLimiter(max_per_minute=20)  # OpenRouter limit
response = await limiter.call_api(client, model, messages)

4. Token Counting

Count tokens before sending requests:

import tiktoken

def count_tokens(text, model="gpt-3.5-turbo"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def estimate_cost(prompt, max_tokens=100):
    input_tokens = count_tokens(prompt)
    total_tokens = input_tokens + max_tokens
    return total_tokens

# Check before calling
prompt = "Explain quantum computing"
if estimate_cost(prompt) > 4000:  # Stay under token limit
    print("Prompt too long, truncating...")
    prompt = prompt[:2000]  # Rough truncation

Multi-Provider Load Balancing

Distribute requests across multiple providers:

from typing import List, Dict
import random

class MultiProviderClient:
    def __init__(self, providers: List[Dict]):
        """
        providers = [
            {
                'name': 'openrouter',
                'client': openrouter_client,
                'model': 'meta-llama/llama-3.3-70b-instruct:free',
                'limit_per_day': 50
            },
            {
                'name': 'groq',
                'client': groq_client,
                'model': 'llama-3.3-70b-versatile',
                'limit_per_day': 1000
            }
        ]
        """
        self.providers = providers
        self.usage = {p['name']: 0 for p in providers}
    
    def get_available_provider(self):
        # Filter providers under their daily limit
        available = [
            p for p in self.providers
            if self.usage[p['name']] < p['limit_per_day']
        ]
        
        if not available:
            raise Exception("All providers exhausted")
        
        # Weighted random selection based on remaining quota
        weights = [
            p['limit_per_day'] - self.usage[p['name']]
            for p in available
        ]
        return random.choices(available, weights=weights)[0]
    
    def call(self, messages):
        provider = self.get_available_provider()
        self.usage[provider['name']] += 1
        
        return provider['client'].chat.completions.create(
            model=provider['model'],
            messages=messages
        )

# Usage
from openai import OpenAI

providers = [
    {
        'name': 'openrouter',
        'client': OpenAI(base_url="https://openrouter.ai/api/v1", api_key=os.getenv("OPENROUTER_KEY")),
        'model': 'meta-llama/llama-3.3-70b-instruct:free',
        'limit_per_day': 50
    },
    {
        'name': 'groq',
        'client': OpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_KEY")),
        'model': 'llama-3.3-70b-versatile',
        'limit_per_day': 1000
    },
    {
        'name': 'cerebras',
        'client': OpenAI(base_url="https://api.cerebras.ai/v1", api_key=os.getenv("CEREBRAS_KEY")),
        'model': 'llama3.3-70b',
        'limit_per_day': 14400
    }
]

multi_client = MultiProviderClient(providers)
response = multi_client.call([{"role": "user", "content": "Hello!"}])

Pro Tip: Use Vercel AI Gateway for automatic provider routing, caching, and observability.

Monitoring and Alerts

Track your usage to avoid hitting limits:

import logging
from datetime import datetime

class UsageTracker:
    def __init__(self):
        self.daily_count = 0
        self.last_reset = datetime.now().date()
        self.logger = logging.getLogger(__name__)
    
    def track_request(self, provider, tokens_used):
        # Reset daily counter
        today = datetime.now().date()
        if today > self.last_reset:
            self.daily_count = 0
            self.last_reset = today
        
        self.daily_count += 1
        
        # Log usage
        self.logger.info(
            f"Provider: {provider}, "
            f"Daily requests: {self.daily_count}, "
            f"Tokens: {tokens_used}"
        )
        
        # Alert if approaching limit (e.g., 80% of OpenRouter's 50/day)
        if provider == 'openrouter' and self.daily_count >= 40:
            self.logger.warning(
                f"Approaching daily limit for {provider}: "
                f"{self.daily_count}/50 requests used"
            )

tracker = UsageTracker()
tracker.track_request('openrouter', 150)

Best Practices

Cache Responses

Store and reuse responses for identical requests to reduce API calls

Batch Requests

Group multiple queries when possible to maximize token efficiency

Use Streaming

Stream responses to provide faster perceived performance without extra requests

Implement Fallbacks

Have backup providers ready when primary provider hits limits

Optimize Prompts

Reduce token usage:

Use concise system messages
Avoid repeating context in conversations
Use smaller models when appropriate
Truncate long inputs intelligently

# Bad: Repeating full context every time
messages = [
    {"role": "system", "content": "You are a helpful assistant with extensive knowledge..."},
    {"role": "user", "content": "Question 1"},
    {"role": "assistant", "content": "Answer 1"},
    {"role": "user", "content": "Question 2"}
]

# Good: Minimal context, reference previous answers
messages = [
    {"role": "system", "content": "Helpful assistant."},
    {"role": "user", "content": "Q1"}
]

Time-Based Distribution

Spread requests throughout the day:For providers with daily limits (e.g., Cerebras with 14,400 req/day):

Max sustainable rate: ~10 requests/minute
Consider off-peak processing for batch jobs
Queue non-urgent requests

from datetime import datetime

def should_process_now(priority='normal'):
    hour = datetime.now().hour
    
    # Process high priority immediately
    if priority == 'high':
        return True
    
    # Queue normal requests during peak hours (9am-5pm)
    if 9 <= hour <= 17:
        return False
    
    return True

Provider-Specific Tips

OpenRouter
Mistral La Plateforme
Cerebras
Groq

Rate Limit: 20 req/min, 50 req/day (1000 with $10 topup)Optimization Strategies:

Use the $10 lifetime topup for 20x increase
Models share a common quota - choose wisely
Consider switching to Groq for high-volume workloads

OpenRouter’s topup is one-time and never expires, making it excellent value.

Next Steps

Best Practices

Learn more optimization techniques and security practices

Provider Comparison

Compare providers to find the best fit for your use case

Get Started

Guides

Overview

Types of Rate Limits

Request-Based

Token-Based

Quota-Based

Provider Rate Limits Comparison

High Volume Providers

Medium Volume Providers

Low Volume Providers

Token Limit Strategies

High Token Throughput Providers

Handling Rate Limits

1. Exponential Backoff

2. Rate Limit Headers

3. Request Queuing

4. Token Counting

Multi-Provider Load Balancing

Monitoring and Alerts

Best Practices

Cache Responses

Batch Requests

Use Streaming

Implement Fallbacks

Provider-Specific Tips

Next Steps

Best Practices

Provider Comparison

Build docs developers (and LLMs) love

Get Started

Guides

​Overview

​Types of Rate Limits

Request-Based

Token-Based

Quota-Based

​Provider Rate Limits Comparison

​High Volume Providers

​Medium Volume Providers

​Low Volume Providers

​Token Limit Strategies

​High Token Throughput Providers

​Handling Rate Limits

​1. Exponential Backoff

​2. Rate Limit Headers

​3. Request Queuing

​4. Token Counting

​Multi-Provider Load Balancing

​Monitoring and Alerts

​Best Practices

Cache Responses

Batch Requests

Use Streaming

Implement Fallbacks

​Provider-Specific Tips

​Next Steps

Best Practices

Provider Comparison

Build docs developers (and LLMs) love

Overview

Types of Rate Limits

Provider Rate Limits Comparison

High Volume Providers

Medium Volume Providers

Low Volume Providers

Token Limit Strategies

High Token Throughput Providers

Handling Rate Limits

1. Exponential Backoff

2. Rate Limit Headers

3. Request Queuing

4. Token Counting

Multi-Provider Load Balancing

Monitoring and Alerts

Best Practices

Provider-Specific Tips

Next Steps