Monitoring Architecture
Components
Metrics Exporters:- YB-Master metrics endpoint:
http://master-ip:7000/metrics - YB-TServer metrics endpoint:
http://tserver-ip:9000/metrics - YCQL metrics endpoint:
http://tserver-ip:12000/metrics - YSQL metrics endpoint:
http://tserver-ip:13000/metrics
- Prometheus (recommended)
- Custom metrics collectors
- Cloud-native monitoring (CloudWatch, Stackdriver, Azure Monitor)
- Grafana dashboards
- YugabyteDB Platform UI
- Custom dashboards
- Prometheus Alertmanager
- PagerDuty integration
- Email/Slack notifications
Quick Start: Web Interfaces
Built-in Web Consoles
YB-Master Console:- Cluster overview
- Table and tablet information
- Master leader status
- Recent operations log
- Replication status
- Tablet server details
- Hosted tablets list
- Resource utilization
- Connection statistics
- Slow operation traces
- PostgreSQL statistics
- Query performance metrics
- Connection pool status
- Statement statistics
- CQL request metrics
- Keyspace statistics
- Connection metrics
Key Metrics
Cluster Health Metrics
Master Metrics:Performance Metrics
Latency Metrics:Resource Utilization
CPU Metrics:Replication Metrics
Compaction Metrics
Prometheus Setup
Installation
Install Prometheus:http://localhost:9090
Node Exporter Setup
Monitor system-level metrics:Grafana Setup
Installation
http://localhost:3000 (default: admin/admin)
Configure Data Source
- Navigate to Configuration > Data Sources
- Add Prometheus data source
- URL:
http://localhost:9090 - Save & Test
Import YugabyteDB Dashboards
YugabyteDB provides pre-built Grafana dashboards:- Grafana UI > Dashboards > Import
- Upload JSON files from repository
- Select Prometheus data source
- Cluster Overview: High-level cluster health
- Node Details: Per-node resource utilization
- Table Details: Per-table performance metrics
- YSQL Metrics: PostgreSQL-specific metrics
- Replication Metrics: Cross-region replication status
Alert Configuration
Prometheus Alert Rules
Createalert_rules.yml:
Alertmanager Configuration
Createalertmanager.yml:
Query Performance Monitoring
YSQL Query Statistics
Enable pg_stat_statements:Active Session History (ASH)
YugabyteDB collects active session samples for query analysis:Log Monitoring
Log Locations
Key Log Patterns
Watch for errors:Log Aggregation
Using ELK Stack:Custom Metrics
Exporting Custom Metrics
YugabyteDB exposes metrics in Prometheus format:Creating Derived Metrics
Prometheus PromQL for calculated metrics:Best Practices
Monitoring Strategy
- Implement monitoring before production
- Establish baseline metrics for normal operation
- Set alerts on leading indicators (CPU, disk, latency)
- Review dashboards regularly for trends
- Test alert routing before incidents
Metric Retention
- Real-time: 15-second granularity, 24 hours
- Recent: 1-minute granularity, 7 days
- Historical: 5-minute granularity, 90 days
- Long-term: 1-hour granularity, 1+ years
Dashboard Organization
Critical Metrics Dashboard:- Cluster status
- Node availability
- Error rates
- Critical alerts
- Latency percentiles
- Throughput trends
- Resource utilization
- Compaction status
- Disk usage trends
- Memory growth
- Connection counts
- Table size growth
Alert Hygiene
- Avoid alert fatigue - tune thresholds appropriately
- Include runbook links in alert descriptions
- Use severity levels correctly (critical vs warning)
- Set escalation policies for unacknowledged alerts
- Review and adjust alert rules quarterly
Troubleshooting Monitoring
Metrics Not Appearing
Check endpoints:High Cardinality Issues
Limit label cardinality to prevent Prometheus overload:- Avoid labels with unbounded values
- Use recording rules for expensive queries
- Aggregate high-cardinality metrics
- Consider metric federation for large clusters
Next Steps
- Performance Tuning - Optimize based on metrics
- Troubleshooting - Diagnose issues from metrics
- Admin Guide - Administrative monitoring tasks
- Backup and Restore - Monitor backup health

