Monitoring and Observability - GovTech Multicloud Platform

Overview

The GovTech platform implements comprehensive monitoring using Prometheus, Grafana, and AWS CloudWatch to ensure visibility into infrastructure and application performance.

Monitoring Stack

Architecture

Application Metrics → Prometheus → Grafana Dashboards
       ↓                  ↓              ↓
  ServiceMonitor    Alert Rules    Visualizations
       ↓                  ↓
  /metrics          Alertmanager → Notifications
       ↓
 AWS CloudWatch Container Insights

Components

Prometheus

Time-series metrics collection and storage with 15-day retention

Grafana

Visualization dashboards with pre-configured GovTech overview

Alertmanager

Alert routing and notification management

CloudWatch

AWS-native monitoring for EKS cluster and infrastructure

Prometheus Setup

Installation

The platform uses the kube-prometheus-stack Helm chart for complete observability:

# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  -f monitoring/prometheus/prometheus.yaml \
  --namespace monitoring \
  --create-namespace

Configuration

Key Prometheus settings from monitoring/prometheus/prometheus.yaml:

prometheus:
  prometheusSpec:
    retention: 15d  # Keep metrics for 15 days
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp2
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

ServiceMonitor for Backend

Prometheus discovers application metrics through ServiceMonitor resources:

monitoring/prometheus/backend-servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: backend-monitor
  namespace: govtech
  labels:
    release: prometheus  # Required for discovery
spec:
  selector:
    matchLabels:
      app: backend
  namespaceSelector:
    matchNames:
      - govtech
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s

The backend application must expose metrics at GET /metrics in Prometheus format using libraries like prom-client or express-prometheus-middleware.

Alert Rules

Kubernetes Alerts

Critical alerts for pod and cluster health:

PodCrashLooping - Critical

Trigger: Pod restarts more than 3 times in 15 minutes

rate(kube_pod_container_status_restarts_total{
  namespace="govtech"
}[15m]) * 60 * 15 > 3

Action: Check pod logs for crash reason

kubectl logs <pod> -n govtech --previous
kubectl describe pod <pod> -n govtech

PodPendingTooLong - Warning

Trigger: Pod in Pending state for more than 10 minutes

kube_pod_status_phase{
  namespace="govtech",
  phase="Pending"
} == 1

Action: Check resource availability

kubectl describe pod <pod> -n govtech
kubectl top nodes

HPAAtMaxReplicas - Warning

Trigger: HPA at maximum replicas for 15 minutes

kube_horizontalpodautoscaler_status_current_replicas{namespace="govtech"}
==
kube_horizontalpodautoscaler_spec_max_replicas{namespace="govtech"}

Action: Consider increasing maxReplicas in HPA configuration

Node Alerts

NodeCPUHigh - Warning

Trigger: Node CPU usage above 85% for 10 minutes

100 - (avg by (node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85

NodeMemoryLow - Critical

Trigger: Node available memory below 10%

(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10

NodeDiskAlmostFull - Warning

Trigger: Node disk usage above 85%

(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} /
 node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 < 15

Application Alerts

PostgresDown - Critical

Trigger: PostgreSQL pod not ready for 2 minutes

kube_pod_status_ready{
  namespace="govtech",
  pod=~"postgres-.*"
} == 0

HighErrorRate - Critical

Trigger: HTTP 5xx error rate exceeds 5%

rate(http_requests_total{namespace="govtech", status=~"5.."}[5m])
/
rate(http_requests_total{namespace="govtech"}[5m])
> 0.05

HighLatency - Warning

Trigger: P99 latency exceeds 2 seconds

histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{namespace="govtech"}[5m])
) > 2

Grafana Dashboards

Access Grafana

Grafana runs as a ClusterIP service. Access it via port-forward:

kubectl port-forward service/grafana-service 3000:3000 -n monitoring

Then open: http://localhost:3000

Default credentials: admin / CHANGE_ME_IN_PRODUCTIONChange the password immediately in production environments.

GovTech Overview Dashboard

The main dashboard (govtech-overview.json) displays:

Cluster Summary

CPU Usage: Cluster-wide average CPU utilization
Memory Usage: Cluster-wide memory consumption
Pods Ready: Count of healthy pods in govtech namespace
Pods Failed: Count of failed pods (alerts if > 0)

Application Metrics

Requests per Second: HTTP traffic by method and route
Latency Percentiles: P50, P95, P99 response times

Dashboard Queries

100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

AWS CloudWatch Integration

Container Insights

CloudWatch Container Insights provides AWS-native monitoring for EKS:

# Enable Container Insights on EKS cluster
aws eks update-cluster-config \
  --name govtech-prod \
  --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}' \
  --region us-east-1

CloudWatch Metrics

Key metrics available in CloudWatch:

cluster_node_count: Number of nodes in the cluster
cluster_failed_node_count: Failed nodes
namespace_number_of_running_pods: Running pods per namespace
pod_cpu_utilization: CPU usage per pod
pod_memory_utilization: Memory usage per pod

CloudWatch Dashboards

Access pre-built Container Insights dashboards:

Go to CloudWatch Console
Navigate to Container Insights
Select cluster: govtech-prod
View:
- Cluster performance
- Pod performance
- Node performance
- Namespace performance

Alertmanager Configuration

Alert Routing

Alerts are routed based on severity:

alertmanager config

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'null'
  
  routes:
    - matchers:
        - severity = critical
      receiver: 'critical-alerts'

Production Setup

For production, configure real notification channels:

receivers:
  - name: 'critical-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#govtech-alerts'
        title: 'GovTech Critical Alert'

Monitoring Best Practices

Set appropriate retention

15 days for Prometheus, longer periods in CloudWatch for compliance

Define SLOs

Availability: 99.9% (8.7 hours downtime/year)
Latency: P99 < 2 seconds
Error rate: < 1% for 5xx errors

Configure actionable alerts

Every alert should have:

Clear description
Runbook link
Severity level
Auto-remediation when possible

Regular dashboard reviews

Weekly review of dashboards to identify trends and capacity needs

Useful Commands

# View Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open http://localhost:9090/targets

# Check ServiceMonitor resources
kubectl get servicemonitor -n govtech

# View alert rules
kubectl get prometheusrule -n monitoring

# Check Alertmanager status
kubectl port-forward -n monitoring svc/prometheus-alertmanager 9093:9093
# Open http://localhost:9093

# View logs from monitoring components
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

Get Started

Architecture

Deployment

Multi-Cloud

Security

Operations

​Overview

​Monitoring Stack

​Architecture

​Components

Prometheus

Grafana

Alertmanager

CloudWatch

​Prometheus Setup

​Installation

​Configuration

​ServiceMonitor for Backend

​Alert Rules

​Kubernetes Alerts

​Node Alerts

​Application Alerts

​Grafana Dashboards

​Access Grafana

​GovTech Overview Dashboard

​Cluster Summary

​Application Metrics

​Dashboard Queries

​AWS CloudWatch Integration

​Container Insights

​CloudWatch Metrics

​CloudWatch Dashboards

​Alertmanager Configuration

​Alert Routing

​Production Setup

​Monitoring Best Practices

​Useful Commands

​Next Steps

Disaster Recovery

Cost Optimization

Build docs developers (and LLMs) love

Overview

Monitoring Stack

Architecture

Components

Prometheus Setup

Installation

Configuration

ServiceMonitor for Backend

Alert Rules

Kubernetes Alerts

Node Alerts

Application Alerts

Grafana Dashboards

Access Grafana

GovTech Overview Dashboard

Cluster Summary

Application Metrics

Dashboard Queries

AWS CloudWatch Integration

Container Insights

CloudWatch Metrics

CloudWatch Dashboards

Alertmanager Configuration

Alert Routing

Production Setup

Monitoring Best Practices

Useful Commands

Next Steps