Dashboard Guide

Available Dashboards

The monitoring stack includes three pre-configured dashboards that provide comprehensive visibility into vLLM performance:

User Metrics

User-facing performance indicators

Machine Metrics

Hardware resource utilization

vLLM Tokens

Token throughput analysis

User Metrics Overview

The User Metrics Overview dashboard tracks user-facing performance indicators critical for production deployments.

Key Metrics

Time to First Token (TTFT)

Measures the latency between request arrival and the first generated token.

Metric: vllm:time_to_first_token_seconds
Importance: Directly impacts perceived responsiveness
Target: Lower is better (typically less than 1s for good UX)

End-to-End Latency

Total time from request submission to complete response.

Metric: vllm:e2e_request_latency_seconds
Includes: Queue time + processing time + network overhead
Use case: SLA monitoring and capacity planning

Queue Waiting Time

Time requests spend waiting before processing begins.

Metric: vllm:time_per_output_token_seconds
Indicator: High queue times suggest capacity constraints
Action: Scale up vLLM instances when consistently elevated

Running Requests

Number of requests currently being processed.

Metric: vllm:num_requests_running
Normal range: Depends on batch size and throughput
Alert threshold: Configure based on your vLLM capacity

Dashboard Panels

The dashboard is organized into logical sections:

Overview: High-level request rate and latency percentiles
Latency Breakdown: TTFT, queue time, and processing time
Request Queue: Running vs waiting requests over time
Throughput: Requests per second and tokens per second

Machine Metrics Overview

The Machine Metrics Overview dashboard monitors hardware resource utilization inside the TEE.

Key Metrics

GPU Usage

Real-time GPU utilization and memory consumption.

GPU Utilization: vllm:gpu_cache_usage_perc
GPU Memory: vllm:gpu_memory_usage_bytes
KV Cache Usage: vllm:cache_usage_perc

KV Cache: The key-value cache stores attention states. High cache usage (>90%) can cause request queuing.

CPU Workload

CPU usage for preprocessing and request orchestration.

Metric: vllm:cpu_usage_percent
Baseline: vLLM is GPU-bound; CPU should be less than 50%
High CPU: May indicate tokenization bottlenecks

Request Queue Depth

Number of requests waiting for GPU availability.

Running: vllm:num_requests_running
Waiting: vllm:num_requests_waiting
Swapped: vllm:num_requests_swapped

Memory Breakdown

Detailed memory allocation across system components.

Total GPU Memory: Available VRAM
Model Weights: Fixed allocation for model parameters
KV Cache: Dynamic allocation for attention states
Activation Memory: Temporary tensors during computation

Resource Alerts

Consider setting alerts for:

GPU Memory >95%: Risk of OOM errors
KV Cache >90%: Request queuing likely
Waiting Requests >10: Capacity saturation

vLLM Tokens Dashboard

Provides token-level granularity for throughput analysis:

Tokens Generated: Total output tokens over time
Tokens per Second: Instantaneous generation rate
Token Efficiency: Tokens per GPU utilization percent
Batch Size Impact: Correlation between batch size and throughput

Customizing Dashboards

Editing Panels

To modify an existing panel:

Click the panel title
Select Edit
Modify the query, visualization, or thresholds
Click Apply to save changes

Dashboard changes are ephemeral unless you export and commit them. See the persistence section below.

Adding Custom Panels

Create a new panel

Click Add panel in the dashboard toolbar.

Write a Prometheus query

Use the query editor to select vLLM metrics (see Prometheus Queries section).

Configure visualization

Choose graph type (Time series, Gauge, Stat, etc.) and customize display options.

Set thresholds

Define warning and critical levels using absolute values or percentages.

Persisting Changes

Dashboards are provisioned from template files. To persist customizations:

Export the dashboard:
- Dashboard settings > JSON Model
- Copy the JSON
Update the template:
- Edit grafana/provisioning/dashboards/[dashboard_name].json.template
- Paste the JSON, preserving template variables:
  - ${VLLM_SCRAPE_JOB_NAME}
  - ${GRAFANA_DATASOURCE_UID}
Rebuild the stack:
```
make docker-stop
make docker-up
```

Always maintain the template variable placeholders when editing JSON templates, or the configuration injection will fail.

Prometheus Queries

Common vLLM Metrics

These metrics are exposed by vLLM and available in Prometheus:

Latency Metrics

# Average TTFT over last 5 minutes
rate(vllm:time_to_first_token_seconds_sum[5m]) 
  / rate(vllm:time_to_first_token_seconds_count[5m])

# 95th percentile end-to-end latency
histogram_quantile(0.95, 
  rate(vllm:e2e_request_latency_seconds_bucket[5m])
)

# Queue waiting time (avg)
rate(vllm:time_per_output_token_seconds_sum[5m]) 
  / rate(vllm:time_per_output_token_seconds_count[5m])

Throughput Metrics

# Requests per second
rate(vllm:request_success_total[1m])

# Tokens generated per second
rate(vllm:generation_tokens_total[1m])

# Current running requests
vllm:num_requests_running

Resource Metrics

# GPU memory usage (bytes)
vllm:gpu_memory_usage_bytes

# GPU cache usage (percentage)
vllm:gpu_cache_usage_perc

# CPU usage
vllm:cpu_usage_percent

Query Tips

Using rate() for counters

Counter metrics (like vllm:request_success_total) always increase. Use rate() to calculate per-second rates:

rate(vllm:request_success_total[5m])

The [5m] window smooths out spikes.

Histogram quantiles for percentiles

For latency metrics, use histogram_quantile() to calculate percentiles:

# P50 (median)
histogram_quantile(0.50, rate(vllm:e2e_request_latency_seconds_bucket[5m]))

# P99
histogram_quantile(0.99, rate(vllm:e2e_request_latency_seconds_bucket[5m]))

Filtering by job

If you have multiple vLLM instances, filter by job name:

vllm:num_requests_running{job="production-vllm"}

Dashboard Best Practices

Time Range Selection

Real-time monitoring: Last 5-15 minutes
Incident investigation: Last 1-6 hours
Capacity planning: Last 7-30 days

Refresh Intervals

Production monitoring: 10-30 seconds
Development: 1 minute
Historical analysis: Manual refresh

Alert Thresholds

Set conservative thresholds to avoid alert fatigue:

Warning: 80% of capacity
Critical: 95% of capacity
Require: 2-3 consecutive violations before alerting

Grafana’s unified alerting can send notifications to Slack, PagerDuty, email, and other channels.

Troubleshooting

No data in panels

Check Prometheus is scraping:

docker compose logs prometheus | grep "vllm"

Verify metrics endpoint:
```
curl http://your-vllm:8000/metrics
```
Inspect Prometheus targets:
- Open Prometheus UI (internal network)
- Navigate to Status > Targets
- Ensure vLLM target is “UP”

Incorrect metric values

Verify time range: Ensure the dashboard time picker matches your expectations
Check aggregation: Some queries use rate() or avg() which can smooth data
Compare with raw metrics: Query Prometheus directly to verify values

Dashboard won’t save

Provisioned dashboards are read-only. To make changes:

Save as a new dashboard (not provisioned)
Or export and update the JSON template (see Persisting Changes)

Get Started

Core Features

Security

Frontend

CVM Services

Monitoring

​Available Dashboards

User Metrics

Machine Metrics

vLLM Tokens

​User Metrics Overview

​Key Metrics

​Time to First Token (TTFT)

​End-to-End Latency

​Queue Waiting Time

​Running Requests

​Dashboard Panels

​Machine Metrics Overview

​Key Metrics

​GPU Usage

​CPU Workload

​Request Queue Depth

​Memory Breakdown

​Resource Alerts

​vLLM Tokens Dashboard

​Customizing Dashboards

​Editing Panels

​Adding Custom Panels

​Persisting Changes

​Prometheus Queries

​Common vLLM Metrics

​Latency Metrics

​Throughput Metrics

​Resource Metrics

​Query Tips

​Dashboard Best Practices

​Time Range Selection

​Refresh Intervals

​Alert Thresholds

​Troubleshooting

​No data in panels

​Incorrect metric values

​Dashboard won’t save

Build docs developers (and LLMs) love

Available Dashboards

User Metrics Overview

Key Metrics

Time to First Token (TTFT)

End-to-End Latency

Queue Waiting Time

Running Requests

Dashboard Panels

Machine Metrics Overview

Key Metrics

GPU Usage

CPU Workload

Request Queue Depth

Memory Breakdown

Resource Alerts

vLLM Tokens Dashboard

Customizing Dashboards

Editing Panels

Adding Custom Panels

Persisting Changes

Prometheus Queries

Common vLLM Metrics

Latency Metrics

Throughput Metrics

Resource Metrics

Query Tips

Dashboard Best Practices

Time Range Selection

Refresh Intervals

Alert Thresholds

Troubleshooting

No data in panels

Incorrect metric values

Dashboard won’t save