Skip to main content

Overview

The GovTech platform implements comprehensive monitoring using Prometheus, Grafana, and AWS CloudWatch to ensure visibility into infrastructure and application performance.

Monitoring Stack

Architecture

Application Metrics → Prometheus → Grafana Dashboards
       ↓                  ↓              ↓
  ServiceMonitor    Alert Rules    Visualizations
       ↓                  ↓
  /metrics          Alertmanager → Notifications

 AWS CloudWatch Container Insights

Components

Prometheus

Time-series metrics collection and storage with 15-day retention

Grafana

Visualization dashboards with pre-configured GovTech overview

Alertmanager

Alert routing and notification management

CloudWatch

AWS-native monitoring for EKS cluster and infrastructure

Prometheus Setup

Installation

The platform uses the kube-prometheus-stack Helm chart for complete observability:
# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  -f monitoring/prometheus/prometheus.yaml \
  --namespace monitoring \
  --create-namespace

Configuration

Key Prometheus settings from monitoring/prometheus/prometheus.yaml:
prometheus:
  prometheusSpec:
    retention: 15d  # Keep metrics for 15 days
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp2
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

ServiceMonitor for Backend

Prometheus discovers application metrics through ServiceMonitor resources:
monitoring/prometheus/backend-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: backend-monitor
  namespace: govtech
  labels:
    release: prometheus  # Required for discovery
spec:
  selector:
    matchLabels:
      app: backend
  namespaceSelector:
    matchNames:
      - govtech
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s
The backend application must expose metrics at GET /metrics in Prometheus format using libraries like prom-client or express-prometheus-middleware.

Alert Rules

Kubernetes Alerts

Critical alerts for pod and cluster health:
Trigger: Pod restarts more than 3 times in 15 minutes
rate(kube_pod_container_status_restarts_total{
  namespace="govtech"
}[15m]) * 60 * 15 > 3
Action: Check pod logs for crash reason
kubectl logs <pod> -n govtech --previous
kubectl describe pod <pod> -n govtech
Trigger: Pod in Pending state for more than 10 minutes
kube_pod_status_phase{
  namespace="govtech",
  phase="Pending"
} == 1
Action: Check resource availability
kubectl describe pod <pod> -n govtech
kubectl top nodes
Trigger: HPA at maximum replicas for 15 minutes
kube_horizontalpodautoscaler_status_current_replicas{namespace="govtech"}
==
kube_horizontalpodautoscaler_spec_max_replicas{namespace="govtech"}
Action: Consider increasing maxReplicas in HPA configuration

Node Alerts

Trigger: Node CPU usage above 85% for 10 minutes
100 - (avg by (node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
Trigger: Node available memory below 10%
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
Trigger: Node disk usage above 85%
(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} /
 node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 < 15

Application Alerts

Trigger: PostgreSQL pod not ready for 2 minutes
kube_pod_status_ready{
  namespace="govtech",
  pod=~"postgres-.*"
} == 0
Trigger: HTTP 5xx error rate exceeds 5%
rate(http_requests_total{namespace="govtech", status=~"5.."}[5m])
/
rate(http_requests_total{namespace="govtech"}[5m])
> 0.05
Trigger: P99 latency exceeds 2 seconds
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{namespace="govtech"}[5m])
) > 2

Grafana Dashboards

Access Grafana

Grafana runs as a ClusterIP service. Access it via port-forward:
kubectl port-forward service/grafana-service 3000:3000 -n monitoring
Then open: http://localhost:3000
Default credentials: admin / CHANGE_ME_IN_PRODUCTIONChange the password immediately in production environments.

GovTech Overview Dashboard

The main dashboard (govtech-overview.json) displays:

Cluster Summary

  • CPU Usage: Cluster-wide average CPU utilization
  • Memory Usage: Cluster-wide memory consumption
  • Pods Ready: Count of healthy pods in govtech namespace
  • Pods Failed: Count of failed pods (alerts if > 0)

Application Metrics

  • Requests per Second: HTTP traffic by method and route
  • Latency Percentiles: P50, P95, P99 response times

Dashboard Queries

100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

AWS CloudWatch Integration

Container Insights

CloudWatch Container Insights provides AWS-native monitoring for EKS:
# Enable Container Insights on EKS cluster
aws eks update-cluster-config \
  --name govtech-prod \
  --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}' \
  --region us-east-1

CloudWatch Metrics

Key metrics available in CloudWatch:
  • cluster_node_count: Number of nodes in the cluster
  • cluster_failed_node_count: Failed nodes
  • namespace_number_of_running_pods: Running pods per namespace
  • pod_cpu_utilization: CPU usage per pod
  • pod_memory_utilization: Memory usage per pod

CloudWatch Dashboards

Access pre-built Container Insights dashboards:
  1. Go to CloudWatch Console
  2. Navigate to Container Insights
  3. Select cluster: govtech-prod
  4. View:
    • Cluster performance
    • Pod performance
    • Node performance
    • Namespace performance

Alertmanager Configuration

Alert Routing

Alerts are routed based on severity:
alertmanager config
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'null'
  
  routes:
    - matchers:
        - severity = critical
      receiver: 'critical-alerts'

Production Setup

For production, configure real notification channels:
receivers:
  - name: 'critical-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#govtech-alerts'
        title: 'GovTech Critical Alert'

Monitoring Best Practices

1

Set appropriate retention

15 days for Prometheus, longer periods in CloudWatch for compliance
2

Define SLOs

  • Availability: 99.9% (8.7 hours downtime/year)
  • Latency: P99 < 2 seconds
  • Error rate: < 1% for 5xx errors
3

Configure actionable alerts

Every alert should have:
  • Clear description
  • Runbook link
  • Severity level
  • Auto-remediation when possible
4

Regular dashboard reviews

Weekly review of dashboards to identify trends and capacity needs

Useful Commands

# View Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Open http://localhost:9090/targets

# Check ServiceMonitor resources
kubectl get servicemonitor -n govtech

# View alert rules
kubectl get prometheusrule -n monitoring

# Check Alertmanager status
kubectl port-forward -n monitoring svc/prometheus-alertmanager 9093:9093
# Open http://localhost:9093

# View logs from monitoring components
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

Next Steps

Disaster Recovery

Learn about DR procedures and RTO/RPO targets

Cost Optimization

Monitor and optimize infrastructure costs

Build docs developers (and LLMs) love