Quickstart

Prerequisites

Before you begin, ensure you have:

Docker and Docker Compose installed
Google Gemini API Key (get one at Google AI Studio)
(Optional) Local Ollama instance for local model support

This quickstart uses Google Gemini for cloud-based inference. For local-only deployment with Ollama, see the Installation guide.

Deploy with Docker Compose

Clone or navigate to the source directory

cd llm-gateway-core

Create environment configuration

Create a .env file in the project root with your configuration:

.env

# Provider Configuration
PROVIDER_TIMEOUT_SECONDS=60
PROVIDER_MAX_RETRIES=3
GEMINI_API_KEY=your_api_key_here

# Redis Configuration
REDIS_URL=redis://redis:6379/0

# Ollama Configuration (optional)
OLLAMA_BASE_URL=http://host.docker.internal:11434

# API Authentication
API_KEYS=sk-gateway-123

# Rate Limiting
RATE_LIMITER_CAPACITY=5
RATE_LIMITER_REFILL_RATE=1

# Cache Configuration
CACHE_TTL_SECONDS=60

Replace your_api_key_here with your actual Gemini API key. Never commit .env files to version control.

Start the gateway stack

Deploy all services with a single command:

docker-compose up -d --build

Docker Compose will start:

Gateway API on port 8000
Redis for caching and rate limiting
Prometheus for metrics collection
Grafana for monitoring dashboards
Streamlit Frontend for testing

Verify deployment

Check that all services are running:

docker-compose ps

Test the health endpoint:

curl http://localhost:8000/api/v1/health

Expected response:

{"status": "ok"}

Make Your First API Request

The gateway is now ready to process chat completion requests. The API uses a standardized request format that works across all providers.

Request Schema

Request Format

{
  "messages": [
    {
      "role": "user",
      "content": "Your prompt here"
    }
  ],
  "model_hint": "online",
  "max_tokens": 512,
  "temperature": 0.7
}

Authentication Required: All requests must include the X-API-Key header with a valid API key from your .env configuration.

Send a Chat Completion Request

curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -H "X-API-Key: sk-gateway-123" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Explain what a Large Language Model is in one sentence."
      }
    ],
    "model_hint": "online",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Response Format

Response

{
  "id": "chat-1234567890",
  "provider": "gemini",
  "content": "A Large Language Model is an AI system trained on vast amounts of text data to understand and generate human-like language.",
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 28,
    "total_tokens": 43
  }
}

Provider Selection with Model Hints

The gateway routes requests to different providers based on the model_hint parameter:

{
  "messages": [{"role": "user", "content": "Hello"}],
  "model_hint": "online"
}
// Also accepts: "fast" or "gemini"

If no model_hint is provided, the gateway defaults to Ollama (local provider).

Test Rate Limiting

The gateway enforces rate limits to protect system resources. By default:

Capacity: 5 requests per client
Refill Rate: 1 token per second

Try exceeding the rate limit:

for i in {1..10}; do
  curl -X POST http://localhost:8000/api/v1/chat \
    -H "Content-Type: application/json" \
    -H "X-API-Key: sk-gateway-123" \
    -d '{
      "messages": [{"role": "user", "content": "Test '$i'"}],
      "model_hint": "online"
    }'
  echo ""
done

After 5 requests, you’ll receive a 429 response:

{
  "detail": "Too many requests. Please wait before trying again."
}

Access the Web Interface

The Streamlit frontend provides a user-friendly interface for testing:

open http://localhost:8501

Features:

Select execution mode (Online/Local)
Submit queries interactively
View formatted responses
Real-time testing without API calls

View Monitoring Dashboards

Access Grafana for real-time metrics visualization:

open http://localhost:3000

Default credentials:

Username: admin
Password: admin

The dashboard shows:

Request rates by provider
Cache hit/miss ratios
Rate limiting metrics
Response latencies

Next Steps

Installation Guide

Learn about detailed configuration options and production deployment

API Reference

Explore the complete API documentation

Troubleshooting

Gateway not responding

Check service logs:

docker-compose logs gateway

Rate limit errors

Adjust rate limiting in .env:

RATE_LIMITER_CAPACITY=10
RATE_LIMITER_REFILL_RATE=2

Restart services:

docker-compose restart gateway

Gemini authentication errors

Verify your API key is correctly set in .env and restart:

docker-compose restart gateway

Get Started

Core Concepts

Providers

Observability

Deployment

Prerequisites

Deploy with Docker Compose

Make Your First API Request

Request Schema

Send a Chat Completion Request

Response Format

Provider Selection with Model Hints

Test Rate Limiting

Access the Web Interface

View Monitoring Dashboards

Next Steps

Installation Guide

API Reference

Troubleshooting

Gateway not responding

Rate limit errors

Gemini authentication errors

Build docs developers (and LLMs) love

Get Started

Core Concepts

Providers

Observability

Deployment

​Prerequisites

​Deploy with Docker Compose

​Make Your First API Request

​Request Schema

​Send a Chat Completion Request

​Response Format

​Provider Selection with Model Hints

​Test Rate Limiting

​Access the Web Interface

​View Monitoring Dashboards

​Next Steps

Installation Guide

API Reference

​Troubleshooting

​Gateway not responding

​Rate limit errors

​Gemini authentication errors

Build docs developers (and LLMs) love

Prerequisites

Deploy with Docker Compose

Make Your First API Request

Request Schema

Send a Chat Completion Request

Response Format

Provider Selection with Model Hints

Test Rate Limiting

Access the Web Interface

View Monitoring Dashboards

Next Steps

Troubleshooting

Gateway not responding

Rate limit errors

Gemini authentication errors