Skip to main content

What is LLM Gateway Core?

LLM Gateway Core is a production-grade infrastructure component designed to abstract multiple Large Language Model (LLM) providers behind a single, unified API. It provides reliable and cost-effective LLM access through intelligent routing, distributed caching, atomic rate limiting, and comprehensive observability.

Unified API

Single endpoint for multiple LLM providers - switch between Google Gemini and Ollama without changing your code

Intelligent Routing

Dynamic provider selection based on request hints: online, local, fast, or secure modes

Distributed Cache

Redis-backed response caching reduces latency and API costs

Rate Limiting

Token bucket algorithm via Redis Lua scripts for atomic, distributed request throttling

Why Use LLM Gateway Core?

Cost Optimization

Reduce LLM API costs through intelligent caching. Repeated queries are served from Redis instead of making expensive provider calls.

Provider Flexibility

Avoid vendor lock-in by abstracting provider-specific APIs. Switch between cloud and local models based on your needs:
  • Google Gemini for high-performance cloud inference
  • Ollama for private, on-premise deployments

Production-Ready Reliability

The gateway implements distributed rate limiting to protect your infrastructure from request spikes and ensures fair resource allocation across clients.
Built-in rate limiting prevents abuse and ensures system stability. Clients exceeding configured thresholds receive standard HTTP 429 responses.

Full Observability

Comprehensive monitoring through Prometheus and Grafana provides visibility into:
  • Request rates and latency by provider
  • Cache hit rates and performance
  • Rate limiting metrics
  • System health indicators

Key Capabilities

System Architecture

The gateway is built on a high-performance FastAPI backend with a provider-agnostic interface:
from fastapi import FastAPI
from app.api.v1 import chat, health, metrics

app = FastAPI(title="LLM Gateway Core")
app.include_router(chat.app, prefix="/api/v1")
app.include_router(health.app, prefix="/api/v1")
app.include_router(metrics.app, prefix="/api/v1")

Core Components

API Layer FastAPI-based REST API providing standardized chat completion endpoints at /api/v1/chat. Provider Router Dynamically selects the optimal model provider based on request hints:
  • online, fast → Google Gemini
  • local, secure → Ollama
Redis Integration
  • Distributed Cache: Persistently stores provider responses with configurable TTL
  • Rate Limiter: Atomic token bucket implementation for fair request throttling
Monitoring Stack Full observability with Prometheus for metrics collection and Grafana for visualization. Streamlit Frontend Clean, responsive interface for demonstration and testing purposes.

Integrated Providers

from app.providers.gemini import GeminiProvider

# High-performance cloud integration
# Used for 'online' and 'fast' request modes
provider = GeminiProvider()

Request Flow

1

Client Authentication

Client sends request with X-API-Key header. Gateway validates against configured API keys.
2

Rate Limiting Check

Redis-backed rate limiter enforces per-client quotas using token bucket algorithm.
3

Cache Lookup

Gateway checks Redis cache for identical request. Cache hits return immediately.
4

Provider Routing

Router selects provider based on model_hint (online, local, fast, secure).
5

LLM Inference

Provider executes request with retry logic and timeout protection.
6

Response Caching

Successful responses are cached in Redis with configurable TTL.
7

Metrics Collection

Prometheus metrics updated for observability and monitoring.

Quickstart

Get up and running in 5 minutes

Installation

Detailed setup and configuration

Build docs developers (and LLMs) love