Monitoring
Datum Cloud provides comprehensive monitoring through Prometheus integration, exposing metrics for controllers, resources, and operations.Overview
Datum exposes metrics in Prometheus format for:- Controller health and performance
- Resource reconciliation
- API request rates and latency
- Quota usage
- Error rates
Enabling Metrics
Built-in Metrics Endpoint
Metrics are enabled by default on port 8443 (HTTPS). Fromconfig/manager/manager.yaml:71:
Access Metrics Locally
Configure Metrics Service
Metrics service is defined inconfig/default/metrics_service.yaml:
Prometheus Integration
Install Prometheus Operator
If not already installed:Enable ServiceMonitor
Datum includes a ServiceMonitor resource inconfig/prometheus/:
Verify Prometheus Discovery
Available Metrics
Controller Metrics
- Reconciliation
- Work Queue
- Go Runtime
- Leader Election
controller_runtime_reconcile_totalTotal number of reconciliations per controller.controller_runtime_reconcile_errors_totalTotal number of reconciliation errors.controller_runtime_reconcile_time_secondsTime spent in reconciliation.
Grafana Dashboards
Controller Performance Dashboard
Resource Status Dashboard
Import Pre-built Dashboard
You can create custom Grafana dashboards using the metrics exposed by Datum controllers. Import them through the Grafana UI or use the JSON import feature.Alerting
Prometheus Alert Rules
Create alert rules for common issues:Verify Alerts
Logging
View Controller Logs
Log Aggregation
Integrate with log aggregation systems:- Loki
- Elasticsearch
- Datadog
Health Checks
Liveness Probe
Controller liveness endpoint:config/manager/manager.yaml:116:
Readiness Probe
Controller readiness endpoint:config/manager/manager.yaml:121:
Health Check Script
Observability Best Practices
Enable metrics
Always enable Prometheus metrics in production.
Set up alerts
Configure alerts for critical issues.
Create dashboards
Build Grafana dashboards for visibility.
Aggregate logs
Send logs to centralized logging system.
Monitor resources
Track CPU, memory, and disk usage.
Review regularly
Regularly review metrics and logs.
Troubleshooting
Metrics not appearing
High memory usage
Controller not reconciling
Next Steps
Security
Security best practices and RBAC
Managing Resources
Resource management workflows
Quota Management
Manage resource quotas
Configuration
Configure metrics and monitoring