Skip to main content
LLM Gateway exposes Prometheus metrics that track caching performance, provider interactions, gateway performance, and rate limiting. All metrics are defined in app/core/metrics.py.

Metric Types

The gateway uses three Prometheus metric types:

Counter

Monotonically increasing value that only goes up (resets on restart)

Histogram

Samples observations and counts them in configurable buckets

Gauge

Value that can go up or down representing current state

Cache Metrics

Metrics tracking the performance of the Redis-based response cache.

cache_hits_total

Type
Counter
Counts the total number of successful cache hits
Description: Incremented when a request finds a matching cached response and returns it without calling the provider. When incremented: Every time cache_middleware.py finds a valid cached entry for a request. Usage example:
from app.core.metrics import CACHE_HITS

CACHE_HITS.inc()  # Increment by 1
PromQL queries:
# Total cache hits
cache_hits_total

# Cache hit rate over 5 minutes
rate(cache_hits_total[5m])

cache_misses_total

Type
Counter
Counts the total number of cache misses
Description: Incremented when a request does not find a cached response and must call the provider. When incremented: Every time cache_middleware.py fails to find a cached entry for a request. PromQL queries:
# Cache hit ratio (percentage)
100 * cache_hits_total / (cache_hits_total + cache_misses_total)

# Cache miss rate over 5 minutes
rate(cache_misses_total[5m])
Monitor the cache hit ratio to optimize caching strategy. A low ratio may indicate:
  • Cache TTL is too short
  • Requests are highly variable
  • Cache key generation needs refinement

Provider Metrics

Metrics tracking interactions with LLM providers (OpenAI, Anthropic, Ollama, etc.).

provider_calls_total

Type
Counter
Counts total provider API calls
Labels
array
  • provider - Provider name (e.g., “openai”, “anthropic”, “ollama”)
Description: Incremented each time the gateway makes a call to a provider’s API. When incremented: Every successful or failed provider API call. PromQL queries:
# Total calls by provider
sum by (provider) (provider_calls_total)

# Call rate per provider (last 5 minutes)
rate(provider_calls_total[5m])

# OpenAI specific calls
provider_calls_total{provider="openai"}

provider_failures_total

Type
Counter
Counts failed provider API calls
Labels
array
  • provider - Provider name (e.g., “openai”, “anthropic”, “ollama”)
Description: Incremented when a provider API call fails due to errors, timeouts, or rate limits. When incremented: On any provider error response or exception during API call. PromQL queries:
# Failure rate by provider
rate(provider_failures_total[5m])

# Error percentage by provider
100 * provider_failures_total / provider_calls_total

# Providers with errors in last hour
provider_failures_total > 0
A high failure rate may indicate:
  • Provider API outages
  • Invalid API keys
  • Rate limit exhaustion
  • Network connectivity issues

provider_call_latency_seconds

Type
Histogram
Measures provider API call latency in seconds
Labels
array
  • provider - Provider name (e.g., “openai”, “anthropic”, “ollama”)
Description: Tracks the time taken for provider API calls from request start to response completion. Histogram buckets: Default Prometheus buckets (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, +Inf) Usage example:
from app.core.metrics import PROVIDER_LATENCY
import time

start = time.time()
try:
    response = await provider_client.call()
finally:
    duration = time.time() - start
    PROVIDER_LATENCY.labels(provider="openai").observe(duration)
PromQL queries:
# Average latency by provider (5 min)
rate(provider_call_latency_seconds_sum[5m]) / rate(provider_call_latency_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(provider_call_latency_seconds_bucket[5m]))

# 99th percentile latency by provider
histogram_quantile(0.99, rate(provider_call_latency_seconds_bucket[5m]))

Gateway Performance Metrics

Metrics tracking overall gateway request handling and performance.

gateway_requests_total

Type
Counter
Counts total requests to the gateway
Description: Incremented for every request received by the gateway, regardless of outcome. When incremented: On every incoming HTTP request to chat completion endpoints. PromQL queries:
# Total requests
gateway_requests_total

# Request rate (requests per second)
rate(gateway_requests_total[1m])

# Total requests in last hour
increase(gateway_requests_total[1h])

gateway_request_latency_seconds

Type
Histogram
Measures end-to-end request latency in seconds
Description: Tracks the total time from request receipt to response completion, including cache lookups, provider calls, and processing. Usage example:
from app.core.metrics import REQUEST_LATENCY
import time

start = time.time()
try:
    response = await process_request(request)
finally:
    REQUEST_LATENCY.observe(time.time() - start)
PromQL queries:
# Average request latency
rate(gateway_request_latency_seconds_sum[5m]) / rate(gateway_request_latency_seconds_count[5m])

# 50th percentile (median)
histogram_quantile(0.50, rate(gateway_request_latency_seconds_bucket[5m]))

# 95th percentile
histogram_quantile(0.95, rate(gateway_request_latency_seconds_bucket[5m]))

# 99th percentile
histogram_quantile(0.99, rate(gateway_request_latency_seconds_bucket[5m]))
Gateway latency includes cache lookup time, provider call time, and processing overhead. Compare with provider_call_latency_seconds to understand caching impact.

gateway_active_requests

Type
Gauge
Number of requests currently being processed
Description: Tracks the current number of in-flight requests being processed by the gateway. When changed:
  • Incremented when request processing starts
  • Decremented when request processing completes
Usage example:
from app.core.metrics import ACTIVE_REQUESTS

ACTIVE_REQUESTS.inc()  # Request started
try:
    response = await process_request()
finally:
    ACTIVE_REQUESTS.dec()  # Request completed
PromQL queries:
# Current active requests
gateway_active_requests

# Maximum concurrent requests (5 min window)
max_over_time(gateway_active_requests[5m])

# Average concurrent requests
avg_over_time(gateway_active_requests[5m])
Monitor this metric to:
  • Detect traffic spikes
  • Identify potential bottlenecks
  • Set autoscaling thresholds
  • Plan capacity requirements

Rate Limiter Metrics

Metrics tracking the token bucket rate limiter behavior.

rate_limit_allowed_total

Type
Counter
Counts requests allowed by the rate limiter
Description: Incremented when a request passes rate limiting checks and is allowed to proceed. When incremented: When the token bucket has sufficient tokens for the request. PromQL queries:
# Total allowed requests
rate_limit_allowed_total

# Allowed request rate
rate(rate_limit_allowed_total[5m])

rate_limit_blocked_total

Type
Counter
Counts requests blocked by the rate limiter
Description: Incremented when a request is rejected due to rate limit exhaustion (returns 429 Too Many Requests). When incremented: When the token bucket is empty and cannot fulfill the request. PromQL queries:
# Total blocked requests
rate_limit_blocked_total

# Block rate
rate(rate_limit_blocked_total[5m])

# Percentage of requests blocked
100 * rate_limit_blocked_total / (rate_limit_allowed_total + rate_limit_blocked_total)
A high block rate indicates:
  • Rate limits may be too restrictive
  • Unexpected traffic surge
  • Potential abuse or bot traffic
Consider adjusting rate limit configuration in rate_limiter.py.

Querying Metrics

Access metrics through multiple interfaces:
Navigate to http://localhost:9090/graph and enter PromQL queries:
# Cache hit ratio over time
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))

Example PromQL Queries

Performance Monitoring

# Requests per second
rate(gateway_requests_total[1m])

# Average response time
rate(gateway_request_latency_seconds_sum[5m]) / rate(gateway_request_latency_seconds_count[5m])

# Cache effectiveness
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m])) * 100

Provider Health

# Provider error rate
rate(provider_failures_total[5m]) / rate(provider_calls_total[5m]) * 100

# Slowest provider (avg latency)
topk(3, rate(provider_call_latency_seconds_sum[5m]) / rate(provider_call_latency_seconds_count[5m]))

Capacity Planning

# Peak concurrent requests
max_over_time(gateway_active_requests[1h])

# Rate limit utilization
rate(rate_limit_blocked_total[5m]) / rate(rate_limit_allowed_total[5m]) * 100

Next Steps

Setup Grafana Dashboards

Learn how to visualize these metrics in Grafana with pre-built dashboards

Build docs developers (and LLMs) love