Overview
The GovTech platform implements comprehensive monitoring using Prometheus, Grafana, and AWS CloudWatch to ensure visibility into infrastructure and application performance.Monitoring Stack
Architecture
Components
Prometheus
Time-series metrics collection and storage with 15-day retention
Grafana
Visualization dashboards with pre-configured GovTech overview
Alertmanager
Alert routing and notification management
CloudWatch
AWS-native monitoring for EKS cluster and infrastructure
Prometheus Setup
Installation
The platform uses thekube-prometheus-stack Helm chart for complete observability:
Configuration
Key Prometheus settings frommonitoring/prometheus/prometheus.yaml:
ServiceMonitor for Backend
Prometheus discovers application metrics through ServiceMonitor resources:monitoring/prometheus/backend-servicemonitor.yaml
The backend application must expose metrics at
GET /metrics in Prometheus format using libraries like prom-client or express-prometheus-middleware.Alert Rules
Kubernetes Alerts
Critical alerts for pod and cluster health:PodCrashLooping - Critical
PodCrashLooping - Critical
Trigger: Pod restarts more than 3 times in 15 minutesAction: Check pod logs for crash reason
PodPendingTooLong - Warning
PodPendingTooLong - Warning
Trigger: Pod in Pending state for more than 10 minutesAction: Check resource availability
HPAAtMaxReplicas - Warning
HPAAtMaxReplicas - Warning
Trigger: HPA at maximum replicas for 15 minutesAction: Consider increasing maxReplicas in HPA configuration
Node Alerts
NodeCPUHigh - Warning
NodeCPUHigh - Warning
Trigger: Node CPU usage above 85% for 10 minutes
NodeMemoryLow - Critical
NodeMemoryLow - Critical
Trigger: Node available memory below 10%
NodeDiskAlmostFull - Warning
NodeDiskAlmostFull - Warning
Trigger: Node disk usage above 85%
Application Alerts
PostgresDown - Critical
PostgresDown - Critical
Trigger: PostgreSQL pod not ready for 2 minutes
HighErrorRate - Critical
HighErrorRate - Critical
Trigger: HTTP 5xx error rate exceeds 5%
HighLatency - Warning
HighLatency - Warning
Trigger: P99 latency exceeds 2 seconds
Grafana Dashboards
Access Grafana
Grafana runs as a ClusterIP service. Access it via port-forward:GovTech Overview Dashboard
The main dashboard (govtech-overview.json) displays:
Cluster Summary
- CPU Usage: Cluster-wide average CPU utilization
- Memory Usage: Cluster-wide memory consumption
- Pods Ready: Count of healthy pods in govtech namespace
- Pods Failed: Count of failed pods (alerts if > 0)
Application Metrics
- Requests per Second: HTTP traffic by method and route
- Latency Percentiles: P50, P95, P99 response times
Dashboard Queries
AWS CloudWatch Integration
Container Insights
CloudWatch Container Insights provides AWS-native monitoring for EKS:CloudWatch Metrics
Key metrics available in CloudWatch:cluster_node_count: Number of nodes in the clustercluster_failed_node_count: Failed nodesnamespace_number_of_running_pods: Running pods per namespacepod_cpu_utilization: CPU usage per podpod_memory_utilization: Memory usage per pod
CloudWatch Dashboards
Access pre-built Container Insights dashboards:- Go to CloudWatch Console
- Navigate to Container Insights
- Select cluster:
govtech-prod - View:
- Cluster performance
- Pod performance
- Node performance
- Namespace performance
Alertmanager Configuration
Alert Routing
Alerts are routed based on severity:alertmanager config
Production Setup
Monitoring Best Practices
Define SLOs
- Availability: 99.9% (8.7 hours downtime/year)
- Latency: P99 < 2 seconds
- Error rate: < 1% for 5xx errors
Configure actionable alerts
Every alert should have:
- Clear description
- Runbook link
- Severity level
- Auto-remediation when possible
Useful Commands
Next Steps
Disaster Recovery
Learn about DR procedures and RTO/RPO targets
Cost Optimization
Monitor and optimize infrastructure costs