Skip to main content

Prerequisites

Before you begin, ensure you have:
  • Docker and Docker Compose installed
  • Google Gemini API Key (get one at Google AI Studio)
  • (Optional) Local Ollama instance for local model support
This quickstart uses Google Gemini for cloud-based inference. For local-only deployment with Ollama, see the Installation guide.

Deploy with Docker Compose

1

Clone or navigate to the source directory

cd llm-gateway-core
2

Create environment configuration

Create a .env file in the project root with your configuration:
.env
# Provider Configuration
PROVIDER_TIMEOUT_SECONDS=60
PROVIDER_MAX_RETRIES=3
GEMINI_API_KEY=your_api_key_here

# Redis Configuration
REDIS_URL=redis://redis:6379/0

# Ollama Configuration (optional)
OLLAMA_BASE_URL=http://host.docker.internal:11434

# API Authentication
API_KEYS=sk-gateway-123

# Rate Limiting
RATE_LIMITER_CAPACITY=5
RATE_LIMITER_REFILL_RATE=1

# Cache Configuration
CACHE_TTL_SECONDS=60
Replace your_api_key_here with your actual Gemini API key. Never commit .env files to version control.
3

Start the gateway stack

Deploy all services with a single command:
docker-compose up -d --build
Docker Compose will start:
  • Gateway API on port 8000
  • Redis for caching and rate limiting
  • Prometheus for metrics collection
  • Grafana for monitoring dashboards
  • Streamlit Frontend for testing
4

Verify deployment

Check that all services are running:
docker-compose ps
Test the health endpoint:
curl http://localhost:8000/api/v1/health
Expected response:
{"status": "ok"}

Make Your First API Request

The gateway is now ready to process chat completion requests. The API uses a standardized request format that works across all providers.

Request Schema

Request Format
{
  "messages": [
    {
      "role": "user",
      "content": "Your prompt here"
    }
  ],
  "model_hint": "online",
  "max_tokens": 512,
  "temperature": 0.7
}
Authentication Required: All requests must include the X-API-Key header with a valid API key from your .env configuration.

Send a Chat Completion Request

curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -H "X-API-Key: sk-gateway-123" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Explain what a Large Language Model is in one sentence."
      }
    ],
    "model_hint": "online",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Response Format

Response
{
  "id": "chat-1234567890",
  "provider": "gemini",
  "content": "A Large Language Model is an AI system trained on vast amounts of text data to understand and generate human-like language.",
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 28,
    "total_tokens": 43
  }
}

Provider Selection with Model Hints

The gateway routes requests to different providers based on the model_hint parameter:
{
  "messages": [{"role": "user", "content": "Hello"}],
  "model_hint": "online"
}
// Also accepts: "fast" or "gemini"
If no model_hint is provided, the gateway defaults to Ollama (local provider).

Test Rate Limiting

The gateway enforces rate limits to protect system resources. By default:
  • Capacity: 5 requests per client
  • Refill Rate: 1 token per second
Try exceeding the rate limit:
for i in {1..10}; do
  curl -X POST http://localhost:8000/api/v1/chat \
    -H "Content-Type: application/json" \
    -H "X-API-Key: sk-gateway-123" \
    -d '{
      "messages": [{"role": "user", "content": "Test '$i'"}],
      "model_hint": "online"
    }'
  echo ""
done
After 5 requests, you’ll receive a 429 response:
{
  "detail": "Too many requests. Please wait before trying again."
}

Access the Web Interface

The Streamlit frontend provides a user-friendly interface for testing:
open http://localhost:8501
Features:
  • Select execution mode (Online/Local)
  • Submit queries interactively
  • View formatted responses
  • Real-time testing without API calls

View Monitoring Dashboards

Access Grafana for real-time metrics visualization:
open http://localhost:3000
Default credentials:
  • Username: admin
  • Password: admin
The dashboard shows:
  • Request rates by provider
  • Cache hit/miss ratios
  • Rate limiting metrics
  • Response latencies

Next Steps

Installation Guide

Learn about detailed configuration options and production deployment

API Reference

Explore the complete API documentation

Troubleshooting

Gateway not responding

Check service logs:
docker-compose logs gateway

Rate limit errors

Adjust rate limiting in .env:
RATE_LIMITER_CAPACITY=10
RATE_LIMITER_REFILL_RATE=2
Restart services:
docker-compose restart gateway

Gemini authentication errors

Verify your API key is correctly set in .env and restart:
docker-compose restart gateway

Build docs developers (and LLMs) love