What is LLM Gateway Core?
LLM Gateway Core is a production-grade infrastructure component designed to abstract multiple Large Language Model (LLM) providers behind a single, unified API. It provides reliable and cost-effective LLM access through intelligent routing, distributed caching, atomic rate limiting, and comprehensive observability.Unified API
Single endpoint for multiple LLM providers - switch between Google Gemini and Ollama without changing your code
Intelligent Routing
Dynamic provider selection based on request hints: online, local, fast, or secure modes
Distributed Cache
Redis-backed response caching reduces latency and API costs
Rate Limiting
Token bucket algorithm via Redis Lua scripts for atomic, distributed request throttling
Why Use LLM Gateway Core?
Cost Optimization
Reduce LLM API costs through intelligent caching. Repeated queries are served from Redis instead of making expensive provider calls.Provider Flexibility
Avoid vendor lock-in by abstracting provider-specific APIs. Switch between cloud and local models based on your needs:- Google Gemini for high-performance cloud inference
- Ollama for private, on-premise deployments
Production-Ready Reliability
The gateway implements distributed rate limiting to protect your infrastructure from request spikes and ensures fair resource allocation across clients.
Full Observability
Comprehensive monitoring through Prometheus and Grafana provides visibility into:- Request rates and latency by provider
- Cache hit rates and performance
- Rate limiting metrics
- System health indicators
Key Capabilities
System Architecture
The gateway is built on a high-performance FastAPI backend with a provider-agnostic interface:Core Components
API Layer FastAPI-based REST API providing standardized chat completion endpoints at/api/v1/chat.
Provider Router
Dynamically selects the optimal model provider based on request hints:
online,fast→ Google Geminilocal,secure→ Ollama
- Distributed Cache: Persistently stores provider responses with configurable TTL
- Rate Limiter: Atomic token bucket implementation for fair request throttling
Integrated Providers
Request Flow
Client Authentication
Client sends request with
X-API-Key header. Gateway validates against configured API keys.Rate Limiting Check
Redis-backed rate limiter enforces per-client quotas using token bucket algorithm.
Quick Links
Quickstart
Get up and running in 5 minutes
Installation
Detailed setup and configuration