Skip to main content

Overview

LLM Gateway provides rate limiting at multiple levels to control costs, prevent abuse, and ensure fair usage across your organization.

API Key Usage Limits

Set hard limits on total token usage per API key:
apps/gateway/src/chat/chat.ts
if (apiKey.usageLimit && Number(apiKey.usage) >= Number(apiKey.usageLimit)) {
  throw new HTTPException(401, {
    message: "Unauthorized: LLMGateway API key reached its usage limit."
  });
}

Setting Usage Limits

  1. Go to API Keys page
  2. Click on an API key
  3. Set “Usage Limit” (in tokens)
  4. Save changes
Usage limits are measured in total tokens (prompt + completion tokens combined).

Free Model Rate Limits

Free models have special rate limiting:
apps/gateway/src/chat/tools/validate-free-model-usage.ts
export async function validateFreeModelUsage(
  context: Context,
  organizationId: string,
  modelId: string,
  modelInfo: ModelDefinition,
  options?: { skipEmailVerification?: boolean }
) {
  // Check email verification
  if (!options?.skipEmailVerification) {
    const org = await db.query.organization.findFirst({
      where: { id: { eq: organizationId } },
      with: { owner: true }
    });
    
    if (!org?.owner?.emailVerified) {
      throw new HTTPException(403, {
        message: "Email verification required to use free models"
      });
    }
  }
  
  // Apply rate limiting
  const usage = await getFreeModelUsage(organizationId, modelId);
  if (usage.requestsToday >= FREE_MODEL_DAILY_LIMIT) {
    throw new HTTPException(429, {
      message: "Free model daily rate limit exceeded"
    });
  }
}

Free Model Limits

PlanDaily RequestsMonthly Requests
Free (verified email)1003,000
ProUnlimitedUnlimited
EnterpriseUnlimitedUnlimited
Free models require email verification and are subject to stricter rate limits.

Provider Rate Limits

Each provider has its own rate limits:

OpenAI

  • Tier 1: 500 requests/minute, 60,000 tokens/minute
  • Tier 2: 5,000 requests/minute, 800,000 tokens/minute
  • Tier 3+: Higher limits based on usage

Anthropic

  • Free tier: 5 requests/minute
  • Build tier: 1,000 requests/minute
  • Scale tier: Custom limits

Google AI Studio

  • Free quota: 15 requests/minute, 1M tokens/day
  • Paid tier: 1,000 requests/minute, 4M tokens/minute

Google Vertex AI

  • Based on Google Cloud quotas
  • Configurable per project
When hitting provider rate limits, LLM Gateway automatically retries with exponential backoff.

Automatic Retry on 429

The gateway handles rate limit errors automatically:
apps/gateway/src/chat/tools/retry-with-fallback.ts
export function shouldRetryRequest(
  statusCode: number,
  errorType: string,
  attempt: number,
): boolean {
  if (attempt >= MAX_RETRIES) {
    return false;
  }
  
  // Retry on rate limits (429)
  if (statusCode === 429) {
    return true;
  }
  
  return false;
}

Retry Strategy

  1. First retry: Wait 1 second
  2. Second retry: Wait 2 seconds
  3. Third retry: Wait 4 seconds
  4. If all fail: Try alternative provider (if available)

Credit-Based Rate Limiting

In credits mode, rate limiting is based on available credits:
apps/gateway/src/chat/chat.ts
if (project.mode === "credits") {
  const regularCredits = parseFloat(organization.credits ?? "0");
  const devPlanCreditsRemaining = organization.devPlan !== "none"
    ? parseFloat(organization.devPlanCreditsLimit ?? "0") -
      parseFloat(organization.devPlanCreditsUsed ?? "0")
    : 0;
  const totalAvailableCredits = regularCredits + devPlanCreditsRemaining;
  
  if (totalAvailableCredits <= 0) {
    throw new HTTPException(402, {
      message: "Organization has insufficient credits"
    });
  }
}

How It Works

  • Credits deducted after each request
  • Requests blocked when credits reach $0
  • Dev plan credits separate from regular credits
  • Auto-renewal available for subscription plans

Project-Level Limits

Each project inherits limits from its organization:
interface Project {
  mode: "api-keys" | "credits" | "hybrid";
  status: "active" | "deleted";
}

// Deleted projects are blocked
if (project.status === "deleted") {
  throw new HTTPException(410, {
    message: "Project has been archived and is no longer accessible"
  });
}

Organization-Level Limits

Plan Limits

FeatureFreeProEnterprise
API Keys/Project520Unlimited
Projects/Org310Unlimited
Team Members15Unlimited
Log Retention3 days90 daysCustom
Rate LimitsStandardHigherCustom

Monitoring Usage

Track usage in real-time:
View usage charts:
  • Requests per day
  • Tokens per day
  • Cost per day
  • By model/provider

Handling Rate Limit Errors

When rate limits are exceeded:
{
  "error": {
    "message": "Rate limit exceeded. Please try again later.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}
HTTP Status Codes:
  • 429: Rate limit exceeded (retry with backoff)
  • 402: Payment required (insufficient credits)
  • 401: Usage limit reached (API key limit hit)

Implementing Client-Side Rate Limiting

import time
from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://api.llmgateway.io/v1",
    api_key="YOUR_API_KEY"
)

def make_request_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o",
                messages=messages
            )
        except RateLimitError as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                time.sleep(wait_time)
            else:
                raise

Best Practices

Set Conservative Limits

Start with lower limits and increase as needed

Monitor Usage

Track usage trends to predict when limits will be hit

Implement Backoff

Use exponential backoff when retrying

Use Multiple Keys

Distribute load across multiple API keys
Configure alerts in your monitoring system to notify you when usage approaches 80% of the limit.

Bypass Rate Limiting

Rate limiting can be disabled for specific scenarios:
// Disable for onboarding flows
if (onboarding) {
  // Skip free model rate limiting
  await validateFreeModelUsage(
    c,
    project.organizationId,
    usedModel,
    modelInfo,
    { skipEmailVerification: true }
  );
}
Only use rate limit bypasses for controlled onboarding flows. Never expose this in production APIs.

Build docs developers (and LLMs) love