Quick Start
Access Grafana
Open your browser to:
Anonymous access is enabled by default for development. You can also log in with:
- Username:
admin - Password:
admin
Verify data source
Prometheus should already be configured as the default data source. Verify by navigating to:Configuration → Data Sources → PrometheusYou should see status: “Data source is working”
Automatic Provisioning
Grafana is automatically configured through provisioning files mounted from the repository.Data Source Configuration
The Prometheus data source is provisioned viagrafana/provisioning/datasources/datasource.yml:
datasource.yml
Display name for the data source in Grafana
Data source type (prometheus, graphite, influxdb, etc.)
proxy means Grafana server queries Prometheus (recommended for Docker)Prometheus server URL (uses Docker service name
prometheus:9090)Makes this the default data source for new panels
Dashboard Provisioning
Dashboard auto-loading is configured viagrafana/provisioning/dashboards/dashboards.yml:
dashboards.yml
Place dashboard JSON files in
grafana/provisioning/dashboards/ to auto-load them on Grafana startup.Docker Configuration
Grafana runs as a Docker service configured indocker-compose.yml:
docker-compose.yml
Environment Variables
Admin user password (default:
admin)Enables anonymous access without login
Role for anonymous users (
Viewer, Editor, or Admin)Creating Dashboards
Dashboard Example: Gateway Overview
Create a comprehensive dashboard to monitor gateway health:Add request rate panel
Panel 1: Request Rate
- Query:
rate(gateway_requests_total[1m]) - Panel type: Graph
- Title: “Requests per Second”
- Y-axis label: “req/s”
Add latency panel
Panel 2: Request Latency (P95)
- Query:
histogram_quantile(0.95, rate(gateway_request_latency_seconds_bucket[5m])) - Panel type: Graph
- Title: “Request Latency (95th percentile)”
- Y-axis label: “seconds”
Add active requests panel
Panel 3: Active Requests
- Query:
gateway_active_requests - Panel type: Stat
- Title: “Active Requests”
Add cache hit rate panel
Panel 4: Cache Hit Rate
- Query:
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m])) * 100 - Panel type: Gauge
- Title: “Cache Hit Rate (%)”
- Thresholds: Red < 50%, Yellow 50-80%, Green > 80%
Dashboard Example: Provider Performance
Monitor individual provider performance:Provider Metrics Dashboard
Provider Metrics Dashboard
Panel 1: Calls by Provider
- Panel type: Bar gauge
- Legend:
{{provider}}
- Panel type: Time series
- Legend:
{{provider}} - P95
- Panel type: Time series
- Legend:
{{provider}} - Y-axis: Percentage
- Panel type: Stat
- Calculation: Total
Essential Panels
Recommended panels for monitoring LLM Gateway:Performance Metrics
Performance Metrics
Cache Performance
Cache Performance
Provider Health
Provider Health
Alert Configuration
Set up alerts to notify you of issues:Set evaluation interval
- Evaluate every: 1m
- For: 5m
Recommended Alerts
High Error Rate
High Latency
Cache Hit Rate Drop
High Rate Limit Blocks
Dashboard Variables
Add variables to make dashboards dynamic:Create provider variable
- Name:
provider - Type: Query
- Query:
label_values(provider_calls_total, provider) - Multi-value: Enabled
- Include All option: Enabled
Exporting and Sharing
Export Dashboard JSON
Share Dashboard Link
Generate shareable links:- Click Share → Link
- Configure options:
- Lock time range: Preserves current time selection
- Shorten URL: Creates shorter link
- Copy and share the URL
Best Practices
Dashboard Organization
Dashboard Organization
- Group by concern: Create separate dashboards for Performance, Providers, Cache, etc.
- Use folders: Organize dashboards into folders (“LLM Gateway”, “Infrastructure”)
- Consistent naming: Use clear, descriptive dashboard names
- Add descriptions: Include dashboard purpose and key metrics in description
Panel Configuration
Panel Configuration
- Descriptive titles: Make panel purpose immediately clear
- Appropriate visualizations: Time series for trends, gauges for ratios, stats for current values
- Set units: Configure Y-axis units (seconds, percentage, requests/sec)
- Use legends: Enable legends with
{{label}}syntax for multi-series - Color coding: Use consistent colors (green = good, yellow = warning, red = critical)
Query Optimization
Query Optimization
- Use appropriate intervals: Match rate() interval to your needs (1m for real-time, 5m for trends)
- Limit time range: Don’t query more data than necessary
- Use recording rules: Pre-compute expensive queries in Prometheus
- Minimize resolution: Adjust Min interval in panel settings
Performance
Performance
- Limit panels per dashboard: 10-15 panels max for fast loading
- Use template variables: Filter data without multiple dashboards
- Set refresh intervals: Balance freshness vs load (30s-1m for most cases)
- Cache query results: Enable query caching in data source settings
Troubleshooting
No data in panels
No data in panels
Possible causes:
- Prometheus not scraping: Check Prometheus UI at
http://localhost:9090/targets - Wrong time range: Adjust time picker to include data
- Incorrect query: Validate PromQL in Prometheus UI first
- Gateway not running: Ensure gateway is up and exposing metrics
Grafana can't connect to Prometheus
Grafana can't connect to Prometheus
Error: “Bad Gateway” or “Service Unavailable”Fix:
-
Verify Prometheus is running:
-
Check data source URL uses Docker service name:
-
Restart Grafana:
Dashboard not saving
Dashboard not saving
Error: “Dashboard save failed”Cause: Provisioned dashboards are read-only by default.Fix:
- Save as new dashboard with different name
- Or modify provisioning config to allow edits:
Alerts not firing
Alerts not firing
Check:
- Alert rule is active (not paused)
- Notification channel is configured
- Evaluation interval allows condition to persist
- Query returns expected values in Explore tab
- Use Test button in alert rule editor
- Check Alerting → Alert rules for evaluation history
Next Steps
Metrics Reference
Learn about all available metrics and their meanings
Observability Overview
Understand the complete observability architecture