Overview
This guide covers common operational issues and their solutions for the GovTech platform, organized by component.Kubernetes Issues
Pod stuck in Pending state
Pod stuck in Pending state
Symptoms: Pod remains in Solutions:
Pending status for more than 5 minutesCommon Causes:- Insufficient cluster resources (CPU/memory)
- PersistentVolumeClaim not bound
- Node selector/affinity not matching any nodes
- Image pull issues
Pod in CrashLoopBackOff
Pod in CrashLoopBackOff
Symptoms: Pod repeatedly starts and crashesDiagnosis:Common Causes & Solutions:
Fix Health Check Timing:
| Cause | Solution |
|---|---|
| Database connection failed | Verify DB_HOST in ConfigMap, check RDS security group |
| Missing environment variables | Check ConfigMap and Secrets are applied |
| Application error on startup | Review logs, fix code, redeploy |
| Health check failing too quickly | Increase initialDelaySeconds in probe |
| OOM (Out of Memory) | Increase memory limits in deployment |
Service not accessible
Service not accessible
High pod restarts
High pod restarts
Symptoms: Pod restart count increasing over timeDiagnosis:Common Causes:
-
OOM Kills: Memory limit too low
-
Liveness Probe Failing: Application temporarily slow
-
Application Bugs: Memory leaks, uncaught exceptions
- Review application logs
- Profile memory usage
- Update dependencies
HPA not scaling
HPA not scaling
Symptoms: Horizontal Pod Autoscaler not creating new pods under loadDiagnosis:Solutions:
- Metrics Server Missing
- Resource Requests Not Set
- Already at Max
Database Issues
Cannot connect to database
Cannot connect to database
Symptoms: Application logs show database connection errorsDiagnosis Checklist:
-
Verify database is running
-
Check connection string
-
Verify credentials
-
Test connectivity from pod
-
Check security groups (RDS)
Database slow query performance
Database slow query performance
Symptoms: Requests timing out, high latency on database operationsDiagnosis:Solutions:
-
Add missing indexes
-
Increase RDS instance size
-
Enable connection pooling
Use PgBouncer or increase
max_connections
Database disk full
Database disk full
Symptoms: Database writes failing, application errorsCheck Disk Usage:Immediate Fix (RDS):Long-term Solutions:
- Enable storage autoscaling
- Archive old data
- Run VACUUM to reclaim space
- Delete old backups in S3
Networking Issues
NetworkPolicy blocking legitimate traffic
NetworkPolicy blocking legitimate traffic
Symptoms: Pods cannot communicate despite correct service configurationDiagnosis:Verify Policy Labels:Temporarily Disable Policy (debugging only):
ALB not created or not healthy
ALB not created or not healthy
Symptoms: Ingress has no ADDRESS, ALB health checks failingCheck ALB Controller:Common Issues:
-
Missing IAM permissions
- Controller needs IAM role with ALB creation permissions
- Check IRSA (IAM Roles for Service Accounts) configuration
-
Incorrect Ingress annotations
-
Subnet tags missing
DNS resolution failing
DNS resolution failing
Symptoms: Pods cannot resolve external domains or internal servicesTest DNS:Check CoreDNS:
CI/CD Issues
GitHub Actions workflow failing
GitHub Actions workflow failing
Common Failures & Solutions:
View Workflow Logs:
| Error | Cause | Solution |
|---|---|---|
Error: No space left on device | Runner disk full | Clean up old Docker images, use larger runner |
Permission denied (publickey) | SSH key issue | Check deploy keys in repo settings |
Error: OIDC token expired | AWS authentication timeout | Increase timeout, check OIDC provider config |
Trivy scan found vulnerabilities | Security issues in image | Update dependencies, review CVEs |
Docker image build failing
Docker image build failing
Common Issues:
-
Base image not found
-
Build context too large
- Add
.dockerignorefile - Exclude node_modules, .git, etc.
- Add
-
Multi-stage build failing
- Check each stage independently
- Verify COPY paths between stages
Deployment rollout stuck
Deployment rollout stuck
Symptoms: Common Causes:
kubectl rollout status never completes- New image has errors (CrashLoopBackOff)
- Readiness probe failing
- Insufficient resources for new pods
Monitoring & Alerts
Prometheus not scraping metrics
Prometheus not scraping metrics
Check Targets:Open http://localhost:9090/targets and look for:Application Must Expose /metrics:
- Red targets: Scrape failing
- Unknown targets: ServiceMonitor not discovered
Grafana dashboard shows no data
Grafana dashboard shows no data
Troubleshooting Steps:
- Check Prometheus datasource
- Grafana > Configuration > Data Sources
- Test connection
- Verify URL:
http://prometheus-service.monitoring.svc.cluster.local:9090
- Test query in Prometheus
- Run query directly in Prometheus UI
- Ensure metrics exist before troubleshooting Grafana
- Check time range
- Metrics may not exist in selected time range
- Try “Last 5 minutes”
Alerts not firing or not received
Alerts not firing or not received
Check Alert Rules:Open http://localhost:9090/alerts to see:Open http://localhost:9093 to verify:
- Inactive: Rule defined but condition not met
- Pending: Condition met, waiting for
forduration - Firing: Alert active, sent to Alertmanager
- Alerts received
- Routing configuration
- Silences (may be muting alerts)
Security Issues
GuardDuty finding: Unusual API activity
GuardDuty finding: Unusual API activity
Response Steps:
-
Review CloudTrail logs
-
Identify affected user/role
- Check
userIdentityin CloudTrail events - Verify if activity was legitimate
- Check
-
If compromised, revoke credentials
Pod Security Standards violation
Pod Security Standards violation
Symptoms: Pod creation rejected with security policy errorCheck Namespace Policy:Common Violations:
Fix Example:
| Violation | Solution |
|---|---|
| Running as root | Add securityContext.runAsNonRoot: true |
| Privileged container | Remove privileged: true |
| Host path mount | Use PersistentVolume instead |
| Host network | Remove hostNetwork: true |
Useful Debugging Commands
Emergency Contacts
For critical production issues:| Severity | Response Time | Contact |
|---|---|---|
| Critical (P0) | 15 minutes | On-call engineer (phone) |
| High (P1) | 1 hour | Slack #govtech-alerts |
| Medium (P2) | 4 hours | Slack #govtech-devops |
| Low (P3) | Next business day | Ticket system |
Related Resources
Monitoring
Dashboards and alerting setup
Disaster Recovery
DR procedures and runbooks