Rate Limiting - LLM Gateway Core

Overview

LLM Gateway Core implements distributed rate limiting using the token bucket algorithm with Redis Lua scripts. This ensures atomic operations and prevents race conditions in multi-instance deployments.

Why Rate Limiting?

Protect Resources

Prevent abuse and ensure fair resource allocation

Cost Control

Limit expensive API calls to cloud providers

SLA Compliance

Enforce usage quotas for different API key tiers

Stability

Prevent system overload from traffic spikes

Token Bucket Algorithm

The token bucket algorithm works as follows:

Each client has a bucket with a maximum capacity of tokens
Tokens are refilled at a constant rate per second
Each request consumes one token
Requests are rejected when the bucket is empty

Implementation

The RedisRateLimiter class in app/core/rate_limiter.py implements the token bucket:

import time
import redis
from app.core.config import settings

class RedisRateLimiter:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.client = redis.from_url(settings.REDIS_URL, decode_responses=True)
        
        # Lua script for atomic token bucket update
        self._lua_script = """
        local key = KEYS[1]
        local capacity = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])
        
        local state = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(state[1]) or capacity
        local last_refill = tonumber(state[2]) or now
        
        local elapsed = math.max(0, now - last_refill)
        local refill = elapsed * refill_rate
        tokens = math.min(capacity, tokens + refill)
        
        if tokens >= 1 then
            tokens = tokens - 1
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 10)
            return 1
        else
            return 0
        end
        """
        self._script_hash = self.client.script_load(self._lua_script)

    def allow(self, key: str) -> bool:
        try:
            now = time.time()
            result = self.client.evalsha(
                self._script_hash, 
                1, 
                f"ratelimit:{key}", 
                self.capacity, 
                self.refill_rate, 
                now
            )
            return bool(result)
        except redis.exceptions.NoScriptError:
            # Reload script if it was flushed from Redis
            self._script_hash = self.client.script_load(self._lua_script)
            return self.allow(key)
        except Exception as e:
            print(f"[RATELIMIT ERROR] key={key}: {e}")
            # Fail open or closed? In real world usually fail open but log heavily
            return True 

Source: app/core/rate_limiter.py:1-56

Lua Script Breakdown

The Lua script ensures atomic operations across multiple steps:

1. Fetch Current State

local state = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(state[1]) or capacity
local last_refill = tonumber(state[2]) or now

Retrieves the current token count and last refill timestamp. Defaults to full capacity for new keys.

2. Calculate Token Refill

local elapsed = math.max(0, now - last_refill)
local refill = elapsed * refill_rate
tokens = math.min(capacity, tokens + refill)

Calculates how many tokens to add based on elapsed time, capped at capacity.

3. Check and Consume Token

if tokens >= 1 then
    tokens = tokens - 1
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 10)
    return 1
else
    return 0
end

If tokens are available:

Decrements token count
Updates state in Redis
Sets expiration to prevent memory leaks
Returns 1 (allow)

Otherwise returns 0 (reject).

The entire script executes atomically in Redis, preventing race conditions in concurrent scenarios.

Integration with API

Rate limiting is enforced via FastAPI dependency in app/api/v1/chat.py:

from fastapi import APIRouter, Depends, Request, HTTPException
from app.core.rate_limiter import RedisRateLimiter
from app.core.metrics import RATE_LIMIT_ALLOWED, RATE_LIMIT_BLOCKED
from app.core.config import settings

# Initialize the rate limiter (Redis-backed)
rate_limiter = RedisRateLimiter(
    capacity=settings.RATE_LIMITER_CAPACITY,
    refill_rate=settings.RATE_LIMITER_REFILL_RATE
)

def get_client_key(request: Request) -> str:
    """Extracts a unique key for the client (API Key or IP)."""
    return request.headers.get("X-API-Key") or request.client.host

async def rate_limit_dependency(request: Request):
    """
    FastAPI dependency to enforce rate limiting and record metrics.
    Also validates the API key if provided.
    """
    api_key = request.headers.get("X-API-Key")
    valid_keys = [k.strip() for k in settings.API_KEYS.split(",") if k.strip()]
    
    if api_key not in valid_keys:
         raise HTTPException(
            status_code=401,
            detail="Invalid or missing API Key"
        )

    key = api_key or request.client.host
    if not rate_limiter.allow(key):
        RATE_LIMIT_BLOCKED.inc()
        raise HTTPException(
            status_code=429, 
            detail="Too many requests. Please wait before trying again."
        )
    
    RATE_LIMIT_ALLOWED.inc()

@app.post("", response_model=ChatResponse, dependencies=[Depends(rate_limit_dependency)])
async def chat(request: ChatRequest):
    """
    Entry point for all chat completions.
    Processes the chat request and returns a chat response.
    """
    return await chat_service.chat(request)

Source: app/api/v1/chat.py:1-53

Rate Limit Key Strategy

The rate limiter uses a composite key strategy:

def get_client_key(request: Request) -> str:
    """Extracts a unique key for the client (API Key or IP)."""
    return request.headers.get("X-API-Key") or request.client.host

Priority:

API Key - If provided, rate limit per API key
IP Address - Fallback to IP-based limiting

This allows:

Different rate limits for different API key tiers
IP-based protection against unauthenticated abuse
Flexible quota management

The key is prefixed with ratelimit: in Redis to namespace it and avoid collisions with cache keys.

Configuration

Rate limiting is configured via environment variables:

# Token bucket capacity (maximum tokens)
RATE_LIMITER_CAPACITY=100

# Tokens added per second
RATE_LIMITER_REFILL_RATE=1.0

Example Configurations

# 10 requests per minute
RATE_LIMITER_CAPACITY=10
RATE_LIMITER_REFILL_RATE=0.166  # 10/60

Burst Handling

The token bucket algorithm naturally handles bursts:

Capacity determines maximum burst size
Refill rate determines sustained throughput

Example: capacity=100, refill_rate=1.0

Client can burst up to 100 requests immediately
After burst, limited to 1 request/second
Bucket refills to allow future bursts

Error Handling

The rate limiter implements multiple error handling strategies:

Script Reload

except redis.exceptions.NoScriptError:
    # Reload script if it was flushed from Redis
    self._script_hash = self.client.script_load(self._lua_script)
    return self.allow(key)

Automatically reloads the Lua script if Redis was restarted or flushed.

Fail-Open Strategy

except Exception as e:
    print(f"[RATELIMIT ERROR] key={key}: {e}")
    # Fail open or closed? In real world usually fail open but log heavily
    return True

If Redis is unavailable, the system fails open (allows requests) to maintain availability. This prevents total outage if Redis goes down.

Failing open means rate limits won’t be enforced during Redis outages. Monitor Redis health and consider failing closed for high-security scenarios.

Metrics and Monitoring

Rate limiting records Prometheus metrics:

RATE_LIMIT_ALLOWED.inc()  # Request allowed
RATE_LIMIT_BLOCKED.inc()  # Request rejected (429)

Key Metrics

Block Rate - RATE_LIMIT_BLOCKED / (RATE_LIMIT_ALLOWED + RATE_LIMIT_BLOCKED)
Top Offenders - API keys/IPs with highest block rate
Rate Limit Errors - Redis connection failures

Redis State Structure

Each rate limit key stores a hash in Redis:

HMGET ratelimit:api-key-123 tokens last_refill

# Returns:
# 1) "47.5"          # Current token count (float)
# 2) "1678901234.56" # Last refill timestamp (unix time)

Expiration

Keys automatically expire to prevent memory leaks:

redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 10)

Expiration time = time to fully refill bucket + 10 second buffer

Testing Rate Limits

Test Script

import httpx
import asyncio

async def test_rate_limit():
    async with httpx.AsyncClient() as client:
        for i in range(105):
            response = await client.post(
                "http://localhost:8000/chat",
                headers={"X-API-Key": "test-key"},
                json={
                    "messages": [{"role": "user", "content": f"Request {i}"}]
                }
            )
            print(f"Request {i}: {response.status_code}")
            if response.status_code == 429:
                print(f"Rate limited at request {i}")
                break

asyncio.run(test_rate_limit())

Expected Output

Request 0: 200
Request 1: 200
...
Request 99: 200
Request 100: 429
Rate limited at request 100

Advanced: Per-Tier Rate Limits

You can implement different rate limits for different API key tiers:

class TieredRateLimiter:
    def __init__(self):
        self.tiers = {
            "free": RedisRateLimiter(capacity=10, refill_rate=0.166),
            "pro": RedisRateLimiter(capacity=100, refill_rate=1.666),
            "enterprise": RedisRateLimiter(capacity=1000, refill_rate=16.666),
        }
    
    def allow(self, api_key: str) -> bool:
        tier = self.get_tier(api_key)  # Look up tier from database
        return self.tiers[tier].allow(api_key)

Best Practices

Set appropriate capacity and refill rate

Capacity should allow reasonable bursts, while refill rate controls sustained load. Test with realistic traffic patterns.

Use Redis persistence

Enable AOF or RDB persistence to preserve rate limit state across Redis restarts.

Monitor block rates

High block rates may indicate legitimate users hitting limits. Consider adjusting capacity or implementing tiered limits.

Consider fail-closed for security

In high-security scenarios, fail closed (reject requests) during Redis outages instead of failing open.

Implement retry-after headers

Return a Retry-After header in 429 responses to help clients back off appropriately.

Response Headers

You can enhance the rate limiting by adding informational headers:

response.headers["X-RateLimit-Limit"] = str(self.capacity)
response.headers["X-RateLimit-Remaining"] = str(int(tokens))
response.headers["X-RateLimit-Reset"] = str(int(now + (1.0 / self.refill_rate)))

This allows clients to track their quota and implement intelligent retry logic.

Next Steps

Architecture

See how rate limiting fits into the system

Caching

Learn about Redis caching implementation

Monitoring

Set up metrics and alerts

API Reference

Complete API documentation

Get Started

Core Concepts

Providers

Observability

Deployment

​Overview

​Why Rate Limiting?

Protect Resources

Cost Control

SLA Compliance

Stability

​Token Bucket Algorithm

​Implementation

​Lua Script Breakdown

​1. Fetch Current State

​2. Calculate Token Refill

​3. Check and Consume Token

​Integration with API

​Rate Limit Key Strategy

​Configuration

​Example Configurations

​Burst Handling

​Error Handling

​Script Reload

​Fail-Open Strategy

​Metrics and Monitoring

​Key Metrics

​Redis State Structure

​Expiration

​Testing Rate Limits

​Test Script

​Expected Output

​Advanced: Per-Tier Rate Limits

​Best Practices

​Response Headers

​Next Steps

Architecture

Caching

Monitoring

API Reference

Build docs developers (and LLMs) love

Overview

Why Rate Limiting?

Token Bucket Algorithm

Implementation

Lua Script Breakdown

1. Fetch Current State

2. Calculate Token Refill

3. Check and Consume Token

Integration with API

Rate Limit Key Strategy

Configuration

Example Configurations

Burst Handling

Error Handling

Script Reload

Fail-Open Strategy

Metrics and Monitoring

Key Metrics

Redis State Structure

Expiration

Testing Rate Limits

Test Script

Expected Output

Advanced: Per-Tier Rate Limits

Best Practices

Response Headers

Next Steps