Monitoring Options
CockroachDB Cloud provides multiple ways to monitor your clusters:- Cloud Console Metrics: Built-in performance dashboards
- Exported Metrics: Integration with Datadog, Prometheus, and CloudWatch
- SQL Activity: Query performance and insights
- Alerts: Automated notifications for issues
Cloud Console Metrics
All CockroachDB Cloud plans include access to performance metrics in the Cloud Console.Access Metrics
Select Tab
Choose from available metric categories:
- Overview: High-level cluster health
- SQL: Query performance
- Request Units: RU consumption (Basic/Standard)
- Changefeeds: CDC performance
- Row-Level TTL: TTL job metrics
- Custom: Create custom charts
Overview Metrics
Key cluster health indicators: CPU Usage:- Shows SQL and storage CPU utilization
- Alert if consistently >70%
- Scale up if sustained high usage
- Queries per second (QPS)
- Statement latency percentiles (P50, P90, P99)
- Identify traffic patterns and spikes
- Used storage across cluster
- Storage capacity (Advanced only)
- Growth trends
- Active SQL connections
- New connection attempts
- Monitor for connection pool issues
- RU consumption rate
- Breakdown by resource type
- Track against provisioned capacity
SQL Metrics
Detailed query performance: Statement Execution:- Total statements executed
- Breakdown by statement type (SELECT, INSERT, UPDATE, DELETE)
- Execution latency by percentile
- Transaction count
- Transaction latency
- Contention events
- Retries and errors
- Active connections
- Idle connections
- Connection rate
Changefeed Metrics
Monitor change data capture: Throughput:- Messages emitted per second
- Bytes emitted per second
- Commit-to-emit latency
- Queue processing time
- Running changefeeds
- Failed changefeeds
- Checkpoint progress
Export Metrics
Standard and Advanced clusters can export metrics to external monitoring platforms.Supported Platforms
Datadog
Full-featured monitoring and alerting
Prometheus
Open-source monitoring and time-series database
CloudWatch
AWS native monitoring service
Export to Datadog
Get Datadog API Key
- Log in to Datadog
- Navigate to Organization Settings > API Keys
- Create or copy an API key
Configure Export
In CockroachDB Cloud Console:
- Go to cluster Monitoring > Metrics Export
- Click Add Integration
- Select Datadog
- Enter API key and select region
- Click Save
crdb.capacity.availablecrdb.capacity.usedcrdb.sql.connscrdb.sql.query.countcrdb.sql.query.latency- Plus 100+ additional metrics
Export to Prometheus
Enable Metrics Export
- Navigate to Monitoring > Metrics Export
- Click Add Integration
- Select Prometheus
Export to CloudWatch
For Advanced clusters on AWS:Configure Export
- Go to Monitoring > Metrics Export
- Select CloudWatch
- Enter role ARN and log group name
- Click Save
SQL Activity Monitoring
Monitor individual query performance and identify issues.Statements Page
View and analyze SQL statement performance:Review Statements
View statements sorted by:
- Execution count
- Rows processed
- Bytes read
- Latency (P50, P90, P99)
- Contention time
- Execution Count: How often the query runs
- Rows Read: Amount of data scanned
- Latency: Query response time
- Contention: Lock wait time
Transactions Page
Monitor transaction-level performance:- Transaction count and rate
- Transaction latency percentiles
- Retry counts
- Contention events
- Transaction breakdown by statement
Insights Page
Automatic performance recommendations: Available Insights:- High retry counts
- Queries with sub-optimal indexes
- Schema design issues
- Transaction contention
- Performance bottlenecks
Review Recommendations
Each insight includes:
- Problem description
- Affected queries
- Recommended solution
- Estimated impact
Alerts and Notifications
Configure automated alerts for cluster issues.Built-in Alerts
Organization Admins automatically receive alerts for:- Planned Maintenance: Upcoming updates and maintenance
- Performance Issues: High CPU, memory, or storage usage
- Cluster Problems: Node failures, replication issues
- Backup Failures: Failed backup jobs
Configure Alert Recipients
Alert Types
| Alert Type | Description | Severity |
|---|---|---|
| Cluster unavailable | Cluster not responding | Critical |
| High CPU | CPU usage >80% for 30+ min | Warning |
| Low storage | Storage >85% full | Warning |
| Backup failed | Backup job failed | Warning |
| Node down | Node unreachable | Critical |
| High memory | Memory usage >90% | Warning |
External Alerting
Integrate with external alerting platforms: Via Exported Metrics:- Set up alerts in Datadog, Prometheus, or CloudWatch
- Define custom thresholds and notification rules
- Combine with application metrics
- Poll cluster status endpoints
- Implement custom alerting logic
- Integrate with PagerDuty, Opsgenie, etc.
Essential Metrics by Plan
Basic Cluster Metrics
Focus on these metrics for Basic clusters:- Request Units: Monitor RU consumption and spend limit
- SQL Statements: Track query volume and latency
- Storage: Monitor data growth
- Connections: Ensure proper connection pooling
Standard Cluster Metrics
Key metrics for Standard clusters:- Provisioned Capacity: Monitor against actual CPU usage
- Request Units: Track RU consumption by resource type
- SQL Performance: Query latency and throughput
- Storage: Monitor usage and growth rate
- Cross-Region Traffic: Optimize for cost
Advanced Cluster Metrics
Important metrics for Advanced clusters:- Node Health: Individual node CPU, memory, storage
- Replication: Replica distribution and health
- SQL Performance: Query and transaction latency
- Storage IOPS: I/O performance (AWS)
- Network: Inter-node and cross-region traffic
DB Console (Advanced Only)
Advanced clusters have access to the DB Console for detailed monitoring.Access DB Console
DB Console Features
Overview:- Cluster topology visualization
- Node status and health
- Live traffic metrics
- 100+ detailed performance metrics
- Customizable time ranges
- Per-node breakdowns
- Live statement execution
- Transaction details
- Contention analysis
- Database and table details
- Index usage statistics
- Schema information
- Running and completed jobs
- Backup/restore progress
- Changefeed status
- Range distribution
- Raft status
- Node logs
- Cluster events
Monitoring Best Practices
Establish Baselines
Identify Patterns
Document typical values for:
- Peak and off-peak hours
- Daily/weekly patterns
- Normal CPU and memory usage
- Typical query latency
Monitor Trends
Track changes over time:- Storage Growth: Plan capacity increases
- Query Volume: Anticipate scaling needs
- Latency Trends: Identify degradation early
- Error Rates: Catch issues before they escalate
Regular Reviews
Schedule monitoring reviews:- Daily: Check for alerts and anomalies
- Weekly: Review performance trends
- Monthly: Analyze capacity and optimization opportunities
- Quarterly: Audit monitoring coverage and alerts
Key Performance Indicators
Track these KPIs:| KPI | Target | Action Threshold |
|---|---|---|
| CPU Utilization | Under 70% | Over 80% for 30 min |
| Query P99 Latency | Under 100ms | Over 200ms |
| Error Rate | Under 0.1% | Over 1% |
| Storage Usage | Under 80% | Over 85% |
| Connection Count | Under 500 | Over 1000 |
Troubleshooting with Metrics
High CPU Usage
Investigate:- Check SQL Statements for expensive queries
- Review Insights for optimization opportunities
- Look for query volume spikes
- Optimize expensive queries
- Add indexes
- Scale up capacity
High Latency
Investigate:- Check transaction contention
- Review query execution plans
- Analyze network latency (multi-region)
- Reduce transaction scope
- Optimize queries
- Adjust table localities
Storage Growth
Investigate:- Review database and table sizes
- Check for data retention policies
- Look for unexpected data growth
- Implement TTL for old data
- Archive historical data
- Compress large columns
Next Steps
Performance Tuning
Optimize cluster performance
SQL Activity
Analyze query performance
Scaling
Learn when and how to scale
Alerting
Configure alerts