Architecture
Umbra’s monitoring stack provides real-time visibility into vLLM model performance using a containerized Prometheus and Grafana deployment.Components
Prometheus
Scrapes vLLM
/metrics endpoint and stores time-series dataGrafana
Visualizes metrics through pre-configured dashboards
vLLM
Exposes detailed runtime metrics at
/metrics endpointDocker Network
Internal network for service communication
Data Flow
- vLLM exposes a
/metricsendpoint with detailed runtime metrics about the model - Prometheus continuously scrapes this endpoint and stores the data as structured time series
- Grafana displays these metrics using pre-configured dashboards
Available Dashboards
The monitoring stack includes three pre-configured dashboards:User Metrics Overview
Tracks user-facing performance metrics:- TTFT (Time to First Token)
- End-to-end latency
- Queue waiting time
- Number of running requests
Machine Metrics Overview
Monitors hardware resource utilization:- GPU usage and memory
- CPU workload
- Running and waiting requests
- System resource consumption
vLLM Tokens Dashboard
Provides token-level metrics for throughput analysis.Configuration Strategy
The monitoring stack uses a secure two-step configuration process:Environment Variables
Sensitive credentials and endpoints are stored in
.env file (never committed to Git)Template Processing
Configuration files are generated from
.template files with whitelisted variable substitutionWhy Templates?This approach protects internal Grafana and Prometheus variables (like
$job or $datasource) from being accidentally replaced while allowing safe injection of secrets.Whitelisted Variables
Prometheus (prometheus.yml.template):
${SCHEME}- HTTP or HTTPS protocol${VLLM_TARGET}- vLLM endpoint address${VLLM_METRICS_AUTH_TOKEN}- Bearer token for metrics access
${VLLM_SCRAPE_JOB_NAME}- Prometheus job name${GRAFANA_DATASOURCE_UID}- Data source identifier
Access
Once running, the monitoring interfaces are available at:- Grafana:
http://localhost:4000(or customGRAFANA_PORT) - Prometheus: Internal Docker network only (not exposed publicly)
