Overview
LLM Gateway provides rate limiting at multiple levels to control costs, prevent abuse, and ensure fair usage across your organization.
API Key Usage Limits
Set hard limits on total token usage per API key:
apps/gateway/src/chat/chat.ts
if (apiKey.usageLimit && Number(apiKey.usage) >= Number(apiKey.usageLimit)) {
throw new HTTPException(401, {
message: "Unauthorized: LLMGateway API key reached its usage limit."
});
}
Setting Usage Limits
Dashboard
API
On Creation
- Go to API Keys page
- Click on an API key
- Set “Usage Limit” (in tokens)
- Save changes
curl https://api.llmgateway.io/keys/api/limit/key_xyz789 \
-X PATCH \
-H "Authorization: Bearer YOUR_SESSION_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"usageLimit": "1000000"
}'
curl https://api.llmgateway.io/keys/api \
-X POST \
-H "Authorization: Bearer YOUR_SESSION_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"description": "Limited API Key",
"projectId": "proj_abc123",
"usageLimit": "500000"
}'
Usage limits are measured in total tokens (prompt + completion tokens combined).
Free Model Rate Limits
Free models have special rate limiting:
apps/gateway/src/chat/tools/validate-free-model-usage.ts
export async function validateFreeModelUsage(
context: Context,
organizationId: string,
modelId: string,
modelInfo: ModelDefinition,
options?: { skipEmailVerification?: boolean }
) {
// Check email verification
if (!options?.skipEmailVerification) {
const org = await db.query.organization.findFirst({
where: { id: { eq: organizationId } },
with: { owner: true }
});
if (!org?.owner?.emailVerified) {
throw new HTTPException(403, {
message: "Email verification required to use free models"
});
}
}
// Apply rate limiting
const usage = await getFreeModelUsage(organizationId, modelId);
if (usage.requestsToday >= FREE_MODEL_DAILY_LIMIT) {
throw new HTTPException(429, {
message: "Free model daily rate limit exceeded"
});
}
}
Free Model Limits
| Plan | Daily Requests | Monthly Requests |
|---|
| Free (verified email) | 100 | 3,000 |
| Pro | Unlimited | Unlimited |
| Enterprise | Unlimited | Unlimited |
Free models require email verification and are subject to stricter rate limits.
Provider Rate Limits
Each provider has its own rate limits:
OpenAI
- Tier 1: 500 requests/minute, 60,000 tokens/minute
- Tier 2: 5,000 requests/minute, 800,000 tokens/minute
- Tier 3+: Higher limits based on usage
Anthropic
- Free tier: 5 requests/minute
- Build tier: 1,000 requests/minute
- Scale tier: Custom limits
Google AI Studio
- Free quota: 15 requests/minute, 1M tokens/day
- Paid tier: 1,000 requests/minute, 4M tokens/minute
Google Vertex AI
- Based on Google Cloud quotas
- Configurable per project
When hitting provider rate limits, LLM Gateway automatically retries with exponential backoff.
Automatic Retry on 429
The gateway handles rate limit errors automatically:
apps/gateway/src/chat/tools/retry-with-fallback.ts
export function shouldRetryRequest(
statusCode: number,
errorType: string,
attempt: number,
): boolean {
if (attempt >= MAX_RETRIES) {
return false;
}
// Retry on rate limits (429)
if (statusCode === 429) {
return true;
}
return false;
}
Retry Strategy
- First retry: Wait 1 second
- Second retry: Wait 2 seconds
- Third retry: Wait 4 seconds
- If all fail: Try alternative provider (if available)
Credit-Based Rate Limiting
In credits mode, rate limiting is based on available credits:
apps/gateway/src/chat/chat.ts
if (project.mode === "credits") {
const regularCredits = parseFloat(organization.credits ?? "0");
const devPlanCreditsRemaining = organization.devPlan !== "none"
? parseFloat(organization.devPlanCreditsLimit ?? "0") -
parseFloat(organization.devPlanCreditsUsed ?? "0")
: 0;
const totalAvailableCredits = regularCredits + devPlanCreditsRemaining;
if (totalAvailableCredits <= 0) {
throw new HTTPException(402, {
message: "Organization has insufficient credits"
});
}
}
How It Works
- Credits deducted after each request
- Requests blocked when credits reach $0
- Dev plan credits separate from regular credits
- Auto-renewal available for subscription plans
Project-Level Limits
Each project inherits limits from its organization:
interface Project {
mode: "api-keys" | "credits" | "hybrid";
status: "active" | "deleted";
}
// Deleted projects are blocked
if (project.status === "deleted") {
throw new HTTPException(410, {
message: "Project has been archived and is no longer accessible"
});
}
Organization-Level Limits
Plan Limits
| Feature | Free | Pro | Enterprise |
|---|
| API Keys/Project | 5 | 20 | Unlimited |
| Projects/Org | 3 | 10 | Unlimited |
| Team Members | 1 | 5 | Unlimited |
| Log Retention | 3 days | 90 days | Custom |
| Rate Limits | Standard | Higher | Custom |
Monitoring Usage
Track usage in real-time:
Dashboard
API
Usage Field
View usage charts:
- Requests per day
- Tokens per day
- Cost per day
- By model/provider
curl "https://api.llmgateway.io/logs?projectId=proj_abc&limit=100" \
-H "Authorization: Bearer YOUR_SESSION_TOKEN"
curl "https://api.llmgateway.io/keys/api?projectId=proj_abc" \
-H "Authorization: Bearer YOUR_SESSION_TOKEN"
# Response includes current usage
{
"apiKeys": [{
"usage": "45000",
"usageLimit": "1000000"
}]
}
Handling Rate Limit Errors
When rate limits are exceeded:
{
"error": {
"message": "Rate limit exceeded. Please try again later.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}
HTTP Status Codes:
- 429: Rate limit exceeded (retry with backoff)
- 402: Payment required (insufficient credits)
- 401: Usage limit reached (API key limit hit)
Implementing Client-Side Rate Limiting
import time
from openai import OpenAI, RateLimitError
client = OpenAI(
base_url="https://api.llmgateway.io/v1",
api_key="YOUR_API_KEY"
)
def make_request_with_retry(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="gpt-4o",
messages=messages
)
except RateLimitError as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # Exponential backoff
time.sleep(wait_time)
else:
raise
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.llmgateway.io/v1',
apiKey: process.env.LLMGATEWAY_API_KEY,
});
async function makeRequestWithRetry(
messages: any[],
maxRetries = 3
) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await client.chat.completions.create({
model: 'gpt-4o',
messages,
});
} catch (error: any) {
if (error.status === 429 && attempt < maxRetries - 1) {
const waitTime = Math.pow(2, attempt) * 1000;
await new Promise(resolve => setTimeout(resolve, waitTime));
} else {
throw error;
}
}
}
}
Best Practices
Set Conservative Limits
Start with lower limits and increase as needed
Monitor Usage
Track usage trends to predict when limits will be hit
Implement Backoff
Use exponential backoff when retrying
Use Multiple Keys
Distribute load across multiple API keys
Configure alerts in your monitoring system to notify you when usage approaches 80% of the limit.
Bypass Rate Limiting
Rate limiting can be disabled for specific scenarios:
// Disable for onboarding flows
if (onboarding) {
// Skip free model rate limiting
await validateFreeModelUsage(
c,
project.organizationId,
usedModel,
modelInfo,
{ skipEmailVerification: true }
);
}
Only use rate limit bypasses for controlled onboarding flows. Never expose this in production APIs.