Available Dashboards
The monitoring stack includes three pre-configured dashboards that provide comprehensive visibility into vLLM performance:User Metrics
User-facing performance indicators
Machine Metrics
Hardware resource utilization
vLLM Tokens
Token throughput analysis
User Metrics Overview
The User Metrics Overview dashboard tracks user-facing performance indicators critical for production deployments.Key Metrics
Time to First Token (TTFT)
Measures the latency between request arrival and the first generated token.- Metric:
vllm:time_to_first_token_seconds - Importance: Directly impacts perceived responsiveness
- Target: Lower is better (typically less than 1s for good UX)
End-to-End Latency
Total time from request submission to complete response.- Metric:
vllm:e2e_request_latency_seconds - Includes: Queue time + processing time + network overhead
- Use case: SLA monitoring and capacity planning
Queue Waiting Time
Time requests spend waiting before processing begins.- Metric:
vllm:time_per_output_token_seconds - Indicator: High queue times suggest capacity constraints
- Action: Scale up vLLM instances when consistently elevated
Running Requests
Number of requests currently being processed.- Metric:
vllm:num_requests_running - Normal range: Depends on batch size and throughput
- Alert threshold: Configure based on your vLLM capacity
Dashboard Panels
The dashboard is organized into logical sections:- Overview: High-level request rate and latency percentiles
- Latency Breakdown: TTFT, queue time, and processing time
- Request Queue: Running vs waiting requests over time
- Throughput: Requests per second and tokens per second
Machine Metrics Overview
The Machine Metrics Overview dashboard monitors hardware resource utilization inside the TEE.Key Metrics
GPU Usage
Real-time GPU utilization and memory consumption.- GPU Utilization:
vllm:gpu_cache_usage_perc - GPU Memory:
vllm:gpu_memory_usage_bytes - KV Cache Usage:
vllm:cache_usage_perc
KV Cache: The key-value cache stores attention states. High cache usage (>90%) can cause request queuing.
CPU Workload
CPU usage for preprocessing and request orchestration.- Metric:
vllm:cpu_usage_percent - Baseline: vLLM is GPU-bound; CPU should be less than 50%
- High CPU: May indicate tokenization bottlenecks
Request Queue Depth
Number of requests waiting for GPU availability.- Running:
vllm:num_requests_running - Waiting:
vllm:num_requests_waiting - Swapped:
vllm:num_requests_swapped
Memory Breakdown
Detailed memory allocation across system components.- Total GPU Memory: Available VRAM
- Model Weights: Fixed allocation for model parameters
- KV Cache: Dynamic allocation for attention states
- Activation Memory: Temporary tensors during computation
Resource Alerts
Consider setting alerts for:- GPU Memory >95%: Risk of OOM errors
- KV Cache >90%: Request queuing likely
- Waiting Requests >10: Capacity saturation
vLLM Tokens Dashboard
Provides token-level granularity for throughput analysis:- Tokens Generated: Total output tokens over time
- Tokens per Second: Instantaneous generation rate
- Token Efficiency: Tokens per GPU utilization percent
- Batch Size Impact: Correlation between batch size and throughput
Customizing Dashboards
Editing Panels
To modify an existing panel:- Click the panel title
- Select Edit
- Modify the query, visualization, or thresholds
- Click Apply to save changes
Adding Custom Panels
Write a Prometheus query
Use the query editor to select vLLM metrics (see Prometheus Queries section).
Configure visualization
Choose graph type (Time series, Gauge, Stat, etc.) and customize display options.
Persisting Changes
Dashboards are provisioned from template files. To persist customizations:-
Export the dashboard:
- Dashboard settings > JSON Model
- Copy the JSON
-
Update the template:
- Edit
grafana/provisioning/dashboards/[dashboard_name].json.template - Paste the JSON, preserving template variables:
${VLLM_SCRAPE_JOB_NAME}${GRAFANA_DATASOURCE_UID}
- Edit
-
Rebuild the stack:
Always maintain the template variable placeholders when editing JSON templates, or the configuration injection will fail.
Prometheus Queries
Common vLLM Metrics
These metrics are exposed by vLLM and available in Prometheus:Latency Metrics
Throughput Metrics
Resource Metrics
Query Tips
Using rate() for counters
Using rate() for counters
Counter metrics (like The
vllm:request_success_total) always increase. Use rate() to calculate per-second rates:[5m] window smooths out spikes.Histogram quantiles for percentiles
Histogram quantiles for percentiles
For latency metrics, use
histogram_quantile() to calculate percentiles:Filtering by job
Filtering by job
If you have multiple vLLM instances, filter by job name:
Dashboard Best Practices
Time Range Selection
- Real-time monitoring: Last 5-15 minutes
- Incident investigation: Last 1-6 hours
- Capacity planning: Last 7-30 days
Refresh Intervals
- Production monitoring: 10-30 seconds
- Development: 1 minute
- Historical analysis: Manual refresh
Alert Thresholds
Set conservative thresholds to avoid alert fatigue:- Warning: 80% of capacity
- Critical: 95% of capacity
- Require: 2-3 consecutive violations before alerting
Grafana’s unified alerting can send notifications to Slack, PagerDuty, email, and other channels.
Troubleshooting
No data in panels
-
Check Prometheus is scraping:
-
Verify metrics endpoint:
-
Inspect Prometheus targets:
- Open Prometheus UI (internal network)
- Navigate to Status > Targets
- Ensure vLLM target is “UP”
Incorrect metric values
- Verify time range: Ensure the dashboard time picker matches your expectations
- Check aggregation: Some queries use
rate()oravg()which can smooth data - Compare with raw metrics: Query Prometheus directly to verify values
Dashboard won’t save
Provisioned dashboards are read-only. To make changes:- Save as a new dashboard (not provisioned)
- Or export and update the JSON template (see Persisting Changes)
