Complete reference of all Prometheus metrics exposed by LLM Gateway
LLM Gateway exposes Prometheus metrics that track caching performance, provider interactions, gateway performance, and rate limiting. All metrics are defined in app/core/metrics.py.
Description: Incremented when a request finds a matching cached response and returns it without calling the provider.When incremented: Every time cache_middleware.py finds a valid cached entry for a request.Usage example:
from app.core.metrics import CACHE_HITSCACHE_HITS.inc() # Increment by 1
PromQL queries:
# Total cache hitscache_hits_total# Cache hit rate over 5 minutesrate(cache_hits_total[5m])
Description: Incremented when a request does not find a cached response and must call the provider.When incremented: Every time cache_middleware.py fails to find a cached entry for a request.PromQL queries:
# Cache hit ratio (percentage)100 * cache_hits_total / (cache_hits_total + cache_misses_total)# Cache miss rate over 5 minutesrate(cache_misses_total[5m])
Monitor the cache hit ratio to optimize caching strategy. A low ratio may indicate:
provider - Provider name (e.g., “openai”, “anthropic”, “ollama”)
Description: Incremented each time the gateway makes a call to a provider’s API.When incremented: Every successful or failed provider API call.PromQL queries:
# Total calls by providersum by (provider) (provider_calls_total)# Call rate per provider (last 5 minutes)rate(provider_calls_total[5m])# OpenAI specific callsprovider_calls_total{provider="openai"}
provider - Provider name (e.g., “openai”, “anthropic”, “ollama”)
Description: Incremented when a provider API call fails due to errors, timeouts, or rate limits.When incremented: On any provider error response or exception during API call.PromQL queries:
# Failure rate by providerrate(provider_failures_total[5m])# Error percentage by provider100 * provider_failures_total / provider_calls_total# Providers with errors in last hourprovider_failures_total > 0
provider - Provider name (e.g., “openai”, “anthropic”, “ollama”)
Description: Tracks the time taken for provider API calls from request start to response completion.Histogram buckets: Default Prometheus buckets (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, +Inf)Usage example:
Description: Incremented for every request received by the gateway, regardless of outcome.When incremented: On every incoming HTTP request to chat completion endpoints.PromQL queries:
# Total requestsgateway_requests_total# Request rate (requests per second)rate(gateway_requests_total[1m])# Total requests in last hourincrease(gateway_requests_total[1h])
Gateway latency includes cache lookup time, provider call time, and processing overhead. Compare with provider_call_latency_seconds to understand caching impact.
# Current active requestsgateway_active_requests# Maximum concurrent requests (5 min window)max_over_time(gateway_active_requests[5m])# Average concurrent requestsavg_over_time(gateway_active_requests[5m])
Description: Incremented when a request passes rate limiting checks and is allowed to proceed.When incremented: When the token bucket has sufficient tokens for the request.PromQL queries:
# Total allowed requestsrate_limit_allowed_total# Allowed request raterate(rate_limit_allowed_total[5m])
Description: Incremented when a request is rejected due to rate limit exhaustion (returns 429 Too Many Requests).When incremented: When the token bucket is empty and cannot fulfill the request.PromQL queries:
# Total blocked requestsrate_limit_blocked_total# Block raterate(rate_limit_blocked_total[5m])# Percentage of requests blocked100 * rate_limit_blocked_total / (rate_limit_allowed_total + rate_limit_blocked_total)
A high block rate indicates:
Rate limits may be too restrictive
Unexpected traffic surge
Potential abuse or bot traffic
Consider adjusting rate limit configuration in rate_limiter.py.