Metrics Reference

LLM Gateway exposes Prometheus metrics that track caching performance, provider interactions, gateway performance, and rate limiting. All metrics are defined in app/core/metrics.py.

Metric Types

The gateway uses three Prometheus metric types:

Counter

Monotonically increasing value that only goes up (resets on restart)

Histogram

Samples observations and counts them in configurable buckets

Gauge

Value that can go up or down representing current state

Cache Metrics

Metrics tracking the performance of the Redis-based response cache.

cache_hits_total

Type

Counter

Counts the total number of successful cache hits

Description: Incremented when a request finds a matching cached response and returns it without calling the provider. When incremented: Every time cache_middleware.py finds a valid cached entry for a request. Usage example:

from app.core.metrics import CACHE_HITS

CACHE_HITS.inc()  # Increment by 1

PromQL queries:

# Total cache hits
cache_hits_total

# Cache hit rate over 5 minutes
rate(cache_hits_total[5m])

cache_misses_total

Type

Counter

Counts the total number of cache misses

Description: Incremented when a request does not find a cached response and must call the provider. When incremented: Every time cache_middleware.py fails to find a cached entry for a request. PromQL queries:

# Cache hit ratio (percentage)
100 * cache_hits_total / (cache_hits_total + cache_misses_total)

# Cache miss rate over 5 minutes
rate(cache_misses_total[5m])

Monitor the cache hit ratio to optimize caching strategy. A low ratio may indicate:

Cache TTL is too short
Requests are highly variable
Cache key generation needs refinement

Provider Metrics

Metrics tracking interactions with LLM providers (OpenAI, Anthropic, Ollama, etc.).

provider_calls_total

Type

Counter

Counts total provider API calls

Labels

array

provider - Provider name (e.g., “openai”, “anthropic”, “ollama”)

Description: Incremented each time the gateway makes a call to a provider’s API. When incremented: Every successful or failed provider API call. PromQL queries:

# Total calls by provider
sum by (provider) (provider_calls_total)

# Call rate per provider (last 5 minutes)
rate(provider_calls_total[5m])

# OpenAI specific calls
provider_calls_total{provider="openai"}

provider_failures_total

Type

Counter

Counts failed provider API calls

Labels

array

provider - Provider name (e.g., “openai”, “anthropic”, “ollama”)

Description: Incremented when a provider API call fails due to errors, timeouts, or rate limits. When incremented: On any provider error response or exception during API call. PromQL queries:

# Failure rate by provider
rate(provider_failures_total[5m])

# Error percentage by provider
100 * provider_failures_total / provider_calls_total

# Providers with errors in last hour
provider_failures_total > 0

A high failure rate may indicate:

Provider API outages
Invalid API keys
Rate limit exhaustion
Network connectivity issues

provider_call_latency_seconds

Type

Histogram

Measures provider API call latency in seconds

Labels

array

provider - Provider name (e.g., “openai”, “anthropic”, “ollama”)

Description: Tracks the time taken for provider API calls from request start to response completion. Histogram buckets: Default Prometheus buckets (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, +Inf) Usage example:

from app.core.metrics import PROVIDER_LATENCY
import time

start = time.time()
try:
    response = await provider_client.call()
finally:
    duration = time.time() - start
    PROVIDER_LATENCY.labels(provider="openai").observe(duration)

PromQL queries:

# Average latency by provider (5 min)
rate(provider_call_latency_seconds_sum[5m]) / rate(provider_call_latency_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(provider_call_latency_seconds_bucket[5m]))

# 99th percentile latency by provider
histogram_quantile(0.99, rate(provider_call_latency_seconds_bucket[5m]))

Gateway Performance Metrics

Metrics tracking overall gateway request handling and performance.

gateway_requests_total

Type

Counter

Counts total requests to the gateway

Description: Incremented for every request received by the gateway, regardless of outcome. When incremented: On every incoming HTTP request to chat completion endpoints. PromQL queries:

# Total requests
gateway_requests_total

# Request rate (requests per second)
rate(gateway_requests_total[1m])

# Total requests in last hour
increase(gateway_requests_total[1h])

gateway_request_latency_seconds

Type

Histogram

Measures end-to-end request latency in seconds

Description: Tracks the total time from request receipt to response completion, including cache lookups, provider calls, and processing. Usage example:

from app.core.metrics import REQUEST_LATENCY
import time

start = time.time()
try:
    response = await process_request(request)
finally:
    REQUEST_LATENCY.observe(time.time() - start)

PromQL queries:

# Average request latency
rate(gateway_request_latency_seconds_sum[5m]) / rate(gateway_request_latency_seconds_count[5m])

# 50th percentile (median)
histogram_quantile(0.50, rate(gateway_request_latency_seconds_bucket[5m]))

# 95th percentile
histogram_quantile(0.95, rate(gateway_request_latency_seconds_bucket[5m]))

# 99th percentile
histogram_quantile(0.99, rate(gateway_request_latency_seconds_bucket[5m]))

Gateway latency includes cache lookup time, provider call time, and processing overhead. Compare with provider_call_latency_seconds to understand caching impact.

gateway_active_requests

Type

Gauge

Number of requests currently being processed

Description: Tracks the current number of in-flight requests being processed by the gateway. When changed:

Incremented when request processing starts
Decremented when request processing completes

Usage example:

from app.core.metrics import ACTIVE_REQUESTS

ACTIVE_REQUESTS.inc()  # Request started
try:
    response = await process_request()
finally:
    ACTIVE_REQUESTS.dec()  # Request completed

PromQL queries:

# Current active requests
gateway_active_requests

# Maximum concurrent requests (5 min window)
max_over_time(gateway_active_requests[5m])

# Average concurrent requests
avg_over_time(gateway_active_requests[5m])

Monitor this metric to:

Detect traffic spikes
Identify potential bottlenecks
Set autoscaling thresholds
Plan capacity requirements

Rate Limiter Metrics

Metrics tracking the token bucket rate limiter behavior.

rate_limit_allowed_total

Type

Counter

Counts requests allowed by the rate limiter

Description: Incremented when a request passes rate limiting checks and is allowed to proceed. When incremented: When the token bucket has sufficient tokens for the request. PromQL queries:

# Total allowed requests
rate_limit_allowed_total

# Allowed request rate
rate(rate_limit_allowed_total[5m])

rate_limit_blocked_total

Type

Counter

Counts requests blocked by the rate limiter

Description: Incremented when a request is rejected due to rate limit exhaustion (returns 429 Too Many Requests). When incremented: When the token bucket is empty and cannot fulfill the request. PromQL queries:

# Total blocked requests
rate_limit_blocked_total

# Block rate
rate(rate_limit_blocked_total[5m])

# Percentage of requests blocked
100 * rate_limit_blocked_total / (rate_limit_allowed_total + rate_limit_blocked_total)

A high block rate indicates:

Rate limits may be too restrictive
Unexpected traffic surge
Potential abuse or bot traffic

Consider adjusting rate limit configuration in rate_limiter.py.

Querying Metrics

Access metrics through multiple interfaces:

Prometheus UI
Raw Endpoint
Grafana

Navigate to http://localhost:9090/graph and enter PromQL queries:

# Cache hit ratio over time
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))

Query the metrics endpoint directly:

curl http://localhost:8000/api/v1/metrics

Use Grafana dashboards with pre-configured visualizations at http://localhost:3000

Example PromQL Queries

Performance Monitoring

# Requests per second
rate(gateway_requests_total[1m])

# Average response time
rate(gateway_request_latency_seconds_sum[5m]) / rate(gateway_request_latency_seconds_count[5m])

# Cache effectiveness
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m])) * 100

Provider Health

# Provider error rate
rate(provider_failures_total[5m]) / rate(provider_calls_total[5m]) * 100

# Slowest provider (avg latency)
topk(3, rate(provider_call_latency_seconds_sum[5m]) / rate(provider_call_latency_seconds_count[5m]))

Capacity Planning

# Peak concurrent requests
max_over_time(gateway_active_requests[1h])

# Rate limit utilization
rate(rate_limit_blocked_total[5m]) / rate(rate_limit_allowed_total[5m]) * 100

Next Steps

Setup Grafana Dashboards

Learn how to visualize these metrics in Grafana with pre-built dashboards

Get Started

Core Concepts

Providers

Observability

Deployment

Metric Types

Counter

Histogram

Gauge

Cache Metrics

cache_hits_total

cache_misses_total

Provider Metrics

provider_calls_total

provider_failures_total

provider_call_latency_seconds

Gateway Performance Metrics

gateway_requests_total

gateway_request_latency_seconds

gateway_active_requests

Rate Limiter Metrics

rate_limit_allowed_total

rate_limit_blocked_total

Querying Metrics

Example PromQL Queries

Performance Monitoring

Provider Health

Capacity Planning

Next Steps

Setup Grafana Dashboards

Build docs developers (and LLMs) love

Get Started

Core Concepts

Providers

Observability

Deployment

​Metric Types

Counter

Histogram

Gauge

​Cache Metrics

​cache_hits_total

​cache_misses_total

​Provider Metrics

​provider_calls_total

​provider_failures_total

​provider_call_latency_seconds

​Gateway Performance Metrics

​gateway_requests_total

​gateway_request_latency_seconds

​gateway_active_requests

​Rate Limiter Metrics

​rate_limit_allowed_total

​rate_limit_blocked_total

​Querying Metrics

​Example PromQL Queries

​Performance Monitoring

​Provider Health

​Capacity Planning

​Next Steps

Setup Grafana Dashboards

Build docs developers (and LLMs) love

Metric Types

Cache Metrics

cache_hits_total

cache_misses_total

Provider Metrics

provider_calls_total

provider_failures_total

provider_call_latency_seconds

Gateway Performance Metrics

gateway_requests_total

gateway_request_latency_seconds

gateway_active_requests

Rate Limiter Metrics

rate_limit_allowed_total

rate_limit_blocked_total

Querying Metrics

Example PromQL Queries

Performance Monitoring

Provider Health

Capacity Planning

Next Steps