Monitoring

Datum Cloud provides comprehensive monitoring through Prometheus integration, exposing metrics for controllers, resources, and operations.

Overview

Datum exposes metrics in Prometheus format for:

Controller health and performance
Resource reconciliation
API request rates and latency
Quota usage
Error rates

Enabling Metrics

Built-in Metrics Endpoint

Metrics are enabled by default on port 8443 (HTTPS). From config/manager/manager.yaml:71:

env:
  - name: METRICS_BIND_ADDRESS
    value: "0"  # Disabled, use METRICS_SECURE instead
  - name: METRICS_SECURE
    value: "true"  # HTTPS metrics on port 8443

Access Metrics Locally

# Port-forward to metrics endpoint
kubectl port-forward -n datum-system deployment/datum-controller-manager 8443:8443

# Query metrics (requires TLS)
curl -k https://localhost:8443/metrics

Configure Metrics Service

Metrics service is defined in config/default/metrics_service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: controller-manager-metrics-service
  namespace: datum-system
spec:
  ports:
    - name: https
      port: 8443
      protocol: TCP
      targetPort: 8443
  selector:
    control-plane: controller-manager

Prometheus Integration

Install Prometheus Operator

If not already installed:

# Install prometheus-operator
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml

Enable ServiceMonitor

Datum includes a ServiceMonitor resource in config/prometheus/:

kubectl apply -k config/prometheus

ServiceMonitor definition:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: controller-manager-metrics-monitor
  namespace: datum-system
spec:
  endpoints:
    - path: /metrics
      port: https
      scheme: https
      tlsConfig:
        insecureSkipVerify: true
  selector:
    matchLabels:
      control-plane: controller-manager

Verify Prometheus Discovery

# Check ServiceMonitor
kubectl get servicemonitor -n datum-system

# Check if Prometheus found the target
kubectl port-forward -n monitoring svc/prometheus-k8s 9090:9090
# Open http://localhost:9090/targets
# Look for "datum-system/controller-manager-metrics-monitor"

Available Metrics

Controller Metrics

Reconciliation
Work Queue
Go Runtime
Leader Election

controller_runtime_reconcile_totalTotal number of reconciliations per controller.

# Reconciliation rate
rate(controller_runtime_reconcile_total[5m])

# By controller
sum by (controller) (rate(controller_runtime_reconcile_total[5m]))

controller_runtime_reconcile_errors_totalTotal number of reconciliation errors.

# Error rate
rate(controller_runtime_reconcile_errors_total[5m])

# Error ratio
rate(controller_runtime_reconcile_errors_total[5m]) / 
rate(controller_runtime_reconcile_total[5m])

controller_runtime_reconcile_time_secondsTime spent in reconciliation.

# Average reconciliation time
rate(controller_runtime_reconcile_time_seconds_sum[5m]) / 
rate(controller_runtime_reconcile_time_seconds_count[5m])

# 95th percentile
histogram_quantile(0.95, 
  rate(controller_runtime_reconcile_time_seconds_bucket[5m])
)

workqueue_adds_totalTotal number of adds to work queue.

rate(workqueue_adds_total[5m])

workqueue_depthCurrent depth of work queue.

workqueue_depth

workqueue_queue_duration_secondsTime items spend in queue before processing.

histogram_quantile(0.95, 
  rate(workqueue_queue_duration_seconds_bucket[5m])
)

go_goroutinesNumber of goroutines.

go_goroutines

go_memstats_alloc_bytesMemory allocated.

go_memstats_alloc_bytes

go_memstats_heap_inuse_bytesHeap memory in use.

go_memstats_heap_inuse_bytes

leader_election_master_statusCurrent leader status (1 = leader, 0 = not leader).

leader_election_master_status

Grafana Dashboards

Controller Performance Dashboard

{
  "dashboard": {
    "title": "Datum Controller Performance",
    "panels": [
      {
        "title": "Reconciliation Rate",
        "targets": [
          {
            "expr": "sum by (controller) (rate(controller_runtime_reconcile_total[5m]))"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "sum by (controller) (rate(controller_runtime_reconcile_errors_total[5m]))"
          }
        ]
      },
      {
        "title": "Reconciliation Duration P95",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum by (controller, le) (rate(controller_runtime_reconcile_time_seconds_bucket[5m])))"
          }
        ]
      },
      {
        "title": "Work Queue Depth",
        "targets": [
          {
            "expr": "workqueue_depth"
          }
        ]
      }
    ]
  }
}

Resource Status Dashboard

{
  "dashboard": {
    "title": "Datum Resources",
    "panels": [
      {
        "title": "Organizations",
        "targets": [
          {
            "expr": "count(kube_customresource_info{customresource_group='resourcemanager.miloapis.com', customresource_kind='Organization'})"
          }
        ]
      },
      {
        "title": "Projects",
        "targets": [
          {
            "expr": "count(kube_customresource_info{customresource_group='resourcemanager.miloapis.com', customresource_kind='Project'})"
          }
        ]
      },
      {
        "title": "Workloads",
        "targets": [
          {
            "expr": "count(kube_customresource_info{customresource_group='compute.datumapis.com', customresource_kind='Workload'})"
          }
        ]
      }
    ]
  }
}

Import Pre-built Dashboard

You can create custom Grafana dashboards using the metrics exposed by Datum controllers. Import them through the Grafana UI or use the JSON import feature.

Alerting

Prometheus Alert Rules

Create alert rules for common issues:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: datum-alerts
  namespace: datum-system
spec:
  groups:
    - name: datum.rules
      interval: 30s
      rules:
        # Controller not reconciling
        - alert: DatumControllerStalled
          expr: |
            rate(controller_runtime_reconcile_total[5m]) == 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Datum controller {{ $labels.controller }} is stalled"
            description: "No reconciliations in the last 5 minutes."
        
        # High error rate
        - alert: DatumHighErrorRate
          expr: |
            (
              rate(controller_runtime_reconcile_errors_total[5m]) /
              rate(controller_runtime_reconcile_total[5m])
            ) > 0.1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Datum controller {{ $labels.controller }} has high error rate"
            description: "Error rate is {{ $value | humanizePercentage }}."
        
        # Work queue backed up
        - alert: DatumWorkQueueDepth
          expr: workqueue_depth > 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Datum work queue {{ $labels.name }} is backed up"
            description: "Queue depth is {{ $value }}."
        
        # Controller pod down
        - alert: DatumControllerDown
          expr: |
            absent(up{job="datum-controller-manager-metrics"})
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Datum controller is down"
            description: "No metrics from Datum controller in the last 5 minutes."
        
        # Memory usage high
        - alert: DatumHighMemoryUsage
          expr: |
            go_memstats_alloc_bytes{job="datum-controller-manager-metrics"} > 1e9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Datum controller memory usage is high"
            description: "Memory usage is {{ $value | humanize1024 }}."

kubectl apply -f datum-alerts.yaml

Verify Alerts

# Check PrometheusRule
kubectl get prometheusrules -n datum-system

# View in Prometheus UI
kubectl port-forward -n monitoring svc/prometheus-k8s 9090:9090
# Open http://localhost:9090/alerts

Logging

View Controller Logs

# Recent logs
kubectl logs -n datum-system -l control-plane=controller-manager --tail=100

# Follow logs
kubectl logs -n datum-system -l control-plane=controller-manager -f

# Since timestamp
kubectl logs -n datum-system -l control-plane=controller-manager --since=1h

# Specific pod
kubectl logs -n datum-system datum-controller-manager-xxx-yyy

Log Aggregation

Integrate with log aggregation systems:

Loki
Elasticsearch
Datadog

# Promtail configuration
scrape_configs:
  - job_name: datum
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - datum-system
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_control_plane]
        regex: controller-manager
        action: keep

# Filebeat configuration
filebeat.inputs:
  - type: container
    paths:
      - /var/log/containers/datum-controller-manager-*.log
    processors:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
            - logs_path:
                logs_path: "/var/log/containers/"

# Datadog agent configuration
logs:
  - type: kubernetes
    source: datum
    service: controller-manager
    tags:
      - env:production

Health Checks

Liveness Probe

Controller liveness endpoint:

curl http://localhost:8081/healthz

From config/manager/manager.yaml:116:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8081
  initialDelaySeconds: 15
  periodSeconds: 20

Readiness Probe

Controller readiness endpoint:

curl http://localhost:8081/readyz

From config/manager/manager.yaml:121:

readinessProbe:
  httpGet:
    path: /readyz
    port: 8081
  initialDelaySeconds: 5
  periodSeconds: 10

Health Check Script

#!/bin/bash
# health-check.sh

set -e

echo "Checking Datum controller health..."

# Check if pod is running
POD=$(kubectl get pod -n datum-system -l control-plane=controller-manager -o jsonpath='{.items[0].metadata.name}')
if [ -z "$POD" ]; then
  echo "ERROR: No controller pod found"
  exit 1
fi

STATUS=$(kubectl get pod -n datum-system $POD -o jsonpath='{.status.phase}')
if [ "$STATUS" != "Running" ]; then
  echo "ERROR: Controller pod is not running (status: $STATUS)"
  exit 1
fi

# Check liveness
kubectl exec -n datum-system $POD -- wget -q -O- http://localhost:8081/healthz >/dev/null
if [ $? -ne 0 ]; then
  echo "ERROR: Liveness check failed"
  exit 1
fi

# Check readiness
kubectl exec -n datum-system $POD -- wget -q -O- http://localhost:8081/readyz >/dev/null
if [ $? -ne 0 ]; then
  echo "ERROR: Readiness check failed"
  exit 1
fi

echo "✓ Datum controller is healthy"

Observability Best Practices

Enable metrics

Always enable Prometheus metrics in production.

Set up alerts

Configure alerts for critical issues.

Create dashboards

Build Grafana dashboards for visibility.

Aggregate logs

Send logs to centralized logging system.

Monitor resources

Track CPU, memory, and disk usage.

Review regularly

Regularly review metrics and logs.

Troubleshooting

Metrics not appearing

# Check if metrics service exists
kubectl get svc -n datum-system controller-manager-metrics-service

# Check if metrics endpoint is accessible
kubectl exec -n datum-system datum-controller-manager-xxx -c datum-controller-manager -- wget -q -O- https://localhost:8443/metrics

# Check ServiceMonitor
kubectl get servicemonitor -n datum-system

# Check Prometheus logs
kubectl logs -n monitoring prometheus-k8s-0 -c prometheus

High memory usage

# Check current memory
kubectl top pod -n datum-system

# Check limits
kubectl get deployment datum-controller-manager -n datum-system -o jsonpath='{.spec.template.spec.containers[0].resources}'

# Increase limits
kubectl edit deployment datum-controller-manager -n datum-system

Controller not reconciling

# Check if leader
kubectl logs -n datum-system -l control-plane=controller-manager | grep leader

# Check work queue
curl -k https://localhost:8443/metrics | grep workqueue_depth

# Check for errors
kubectl logs -n datum-system -l control-plane=controller-manager | grep ERROR

Next Steps

Security

Security best practices and RBAC

Managing Resources

Resource management workflows

Quota Management

Manage resource quotas

Configuration

Configure metrics and monitoring

Get Started

Core Concepts

Deployment

Operations

​Monitoring

​Overview

​Enabling Metrics

​Built-in Metrics Endpoint

​Access Metrics Locally

​Configure Metrics Service

​Prometheus Integration

​Install Prometheus Operator

​Enable ServiceMonitor

​Verify Prometheus Discovery

​Available Metrics

​Controller Metrics

​Grafana Dashboards

​Controller Performance Dashboard

​Resource Status Dashboard

​Import Pre-built Dashboard

​Alerting

​Prometheus Alert Rules

​Verify Alerts

​Logging

​View Controller Logs

​Log Aggregation

​Health Checks

​Liveness Probe

​Readiness Probe

​Health Check Script

​Observability Best Practices