Overview
The GovTech platform disaster recovery plan defines procedures to restore services after catastrophic failures. This document is based onplatform/disaster-recovery/runbooks/DR_PLAN.md.
Recovery Objectives
RTO: 4 Hours
Recovery Time ObjectiveMaximum time to fully restore service after a disaster
RPO: 24 Hours
Recovery Point ObjectiveMaximum acceptable data loss (daily backups at 2am UTC)
Availability Target
| Metric | Target | Allowed Downtime |
|---|---|---|
| Availability | 99.9% | 8.7 hours/year |
| RTO | 4 hours | Complete service restoration |
| RPO | 24 hours | Last backup at 2am UTC |
Disaster Scenarios
1. Availability Zone Failure
Probability: Low | Impact: HighAuto-Recovery: AWS Multi-AZ architecture handles this automatically
- RDS: Automatic failover to standby replica in another AZ (1-2 minutes)
- EKS: Pods redistribute to nodes in other AZs automatically
- ALB: Always distributes traffic across multiple AZs
2. Complete Region Failure
Probability: Very Low | Impact: Critical Recovery Procedure: Complete Region Restoration Estimated Time: 2-4 hours3. Database Corruption or Accidental Deletion
Probability: Medium | Impact: High RDS has automatic daily backups:- Production: 30-day retention
- Development: 3-day retention
4. Security Breach
Probability: Low | Impact: Very High Recovery Procedure: Security Incident Response Immediate Action: 15 minutes to containment5. EKS Cluster Failure
Probability: Low | Impact: HighEKS control plane is managed by AWS (99.95% SLA). Worker node failures are handled automatically by Node Groups.Action: If entire Node Group fails, recreate with Terraform.
6. Terraform State Deletion
Probability: Low | Impact: High Terraform state in S3 has versioning enabled. Previous states can be recovered.Recovery Procedures
Region Restoration
Automated Script:
disaster-recovery/scripts/restore-infrastructure.sh
Database Restoration
List Available Snapshots
List Available Snapshots
Restore from Latest Snapshot
Restore from Latest Snapshot
Restore from Specific Snapshot
Restore from Specific Snapshot
Update Application Configuration
Update Application Configuration
After restoration, update the application to use the new database:
Security Response
Timeline for security breach response:T+30 min: Scope Assessment
Determine extent of breach:
- Check CloudTrail logs
- Review GuardDuty findings
- Identify affected resources
Responsibility Matrix
| Scenario | Detection | Response | Approval |
|---|---|---|---|
| AZ Failure | CloudWatch (auto) | - | - |
| Region Failure | On-call / CloudWatch | Infrastructure + DevOps | Project Lead |
| DB Corruption | DevOps / Users | Infrastructure | Project Lead |
| Security Breach | DevOps / Security | Entire Team | Lead + Authorities |
DR Testing Schedule
Regular testing ensures the DR plan works when needed:| Test | Frequency | Type |
|---|---|---|
| DB backup restoration (dev) | Monthly | Automated |
| RDS Multi-AZ failover | Quarterly | Simulated |
| EKS cluster recreation (dev) | Semi-annually | Manual |
| Full region disaster drill | Annually | Manual |
Test Backup Restoration
Emergency Contacts
| Role | Responsibility | Contact |
|---|---|---|
| Project Lead | Final restoration approval | Phone (24/7) |
| Infrastructure | Terraform, VPC, RDS | Slack #govtech-infra |
| Deployment | Kubernetes, application | Slack #govtech-deploy |
| DevOps | CI/CD, monitoring | Slack #govtech-devops |
Recovery Checklist
Use this checklist for any disaster scenario:- Identify and document incident (time, probable cause, scope)
- Notify team and project lead
- Determine disaster scenario (1-6 above)
- Execute corresponding procedure
- Verify service restoration:
./tests/e2e/test-deployment.sh - Communicate to users if downtime occurred
- Document post-mortem (root cause, actions taken, prevention)
- Update DR plan if gaps found in procedure
Backup Locations
RDS Snapshots
- Location: AWS RDS automated backups
- Retention: 30 days (prod), 3 days (dev)
- Encryption: KMS encrypted
- Region: us-east-1 (same as primary)
Application Backups
- Location: S3 bucket
govtech-{env}-app-storage-835960996869 - Prefix:
backups/postgresql/ - Format: pg_dump custom format (compressed)
- Retention: 30 days
- Schedule: Daily at 2am UTC
Terraform State
- Location: S3 bucket
govtech-terraform-state-835960996869 - Versioning: Enabled
- Encryption: KMS encrypted
- Backup: All versions retained indefinitely
Related Resources
Backup & Restore
Detailed backup procedures and restoration steps
Troubleshooting
Common issues and solutions