Skip to main content

Overview

The GovTech platform disaster recovery plan defines procedures to restore services after catastrophic failures. This document is based on platform/disaster-recovery/runbooks/DR_PLAN.md.

Recovery Objectives

RTO: 4 Hours

Recovery Time ObjectiveMaximum time to fully restore service after a disaster

RPO: 24 Hours

Recovery Point ObjectiveMaximum acceptable data loss (daily backups at 2am UTC)

Availability Target

MetricTargetAllowed Downtime
Availability99.9%8.7 hours/year
RTO4 hoursComplete service restoration
RPO24 hoursLast backup at 2am UTC

Disaster Scenarios

1. Availability Zone Failure

Probability: Low | Impact: High
Auto-Recovery: AWS Multi-AZ architecture handles this automatically
  • RDS: Automatic failover to standby replica in another AZ (1-2 minutes)
  • EKS: Pods redistribute to nodes in other AZs automatically
  • ALB: Always distributes traffic across multiple AZs
Action Required: None. Failover is automatic.

2. Complete Region Failure

Probability: Very Low | Impact: Critical
This scenario requires manual recovery. The current architecture is single-region (us-east-1).
Recovery Procedure: Complete Region Restoration Estimated Time: 2-4 hours

3. Database Corruption or Accidental Deletion

Probability: Medium | Impact: High RDS has automatic daily backups:
  • Production: 30-day retention
  • Development: 3-day retention
Recovery Procedure: Database Restoration Estimated Time: 20-45 minutes

4. Security Breach

Probability: Low | Impact: Very High Recovery Procedure: Security Incident Response Immediate Action: 15 minutes to containment

5. EKS Cluster Failure

Probability: Low | Impact: High
EKS control plane is managed by AWS (99.95% SLA). Worker node failures are handled automatically by Node Groups.Action: If entire Node Group fails, recreate with Terraform.

6. Terraform State Deletion

Probability: Low | Impact: High Terraform state in S3 has versioning enabled. Previous states can be recovered.
# List versions of state file
aws s3api list-object-versions \
  --bucket govtech-terraform-state-835960996869 \
  --prefix prod/terraform.tfstate

# Restore previous version
aws s3api get-object \
  --bucket govtech-terraform-state-835960996869 \
  --key prod/terraform.tfstate \
  --version-id <VERSION_ID> \
  terraform.tfstate.restored

Recovery Procedures

Region Restoration

1

Phase 1: Infrastructure (45-90 min)

Deploy infrastructure in alternate region using Terraform
cd terraform/environments/prod

# Update region in terraform.tfvars
sed -i 's/us-east-1/us-east-2/g' terraform.tfvars

# Apply infrastructure
terraform init
terraform plan
terraform apply
2

Phase 2: Database (30-60 min)

Restore RDS from snapshot (copy snapshot cross-region first)
# Copy snapshot to new region
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:us-east-1:ACCOUNT:snapshot:NAME \
  --target-db-snapshot-identifier govtech-prod-restore \
  --region us-east-2

# Restore from snapshot
./disaster-recovery/scripts/restore-infrastructure.sh \
  --region us-east-2 \
  --environment prod
3

Phase 3: Application (15-30 min)

Deploy application to new EKS cluster
# Connect to new cluster
aws eks update-kubeconfig --name govtech-prod --region us-east-2

# Deploy application
kubectl apply -f kubernetes/
kubectl rollout status deployment/backend -n govtech
kubectl rollout status deployment/frontend -n govtech
4

Phase 4: DNS (5-15 min)

Update Route53 to point to new ALB
# Get new ALB DNS
kubectl get ingress govtech-ingress -n govtech \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

# Update Route53 record (manually or via Terraform)
# Point govtech.example.com to new ALB
Automated Script:
disaster-recovery/scripts/restore-infrastructure.sh
chmod +x disaster-recovery/scripts/restore-infrastructure.sh
./disaster-recovery/scripts/restore-infrastructure.sh \
  --region us-east-2 \
  --environment prod

Database Restoration

aws rds describe-db-snapshots \
  --db-instance-identifier govtech-prod-postgres \
  --query 'DBSnapshots[*].{Date:SnapshotCreateTime,ID:DBSnapshotIdentifier}' \
  --output table
chmod +x disaster-recovery/scripts/restore-database.sh

./disaster-recovery/scripts/restore-database.sh \
  --snapshot latest \
  --environment prod
Production Approval RequiredThe script requires approval code: RESTORE-PROD-YYYYMMDDExample: RESTORE-PROD-20260303
./disaster-recovery/scripts/restore-database.sh \
  --snapshot rds:govtech-prod-postgres-2026-03-01-02-00 \
  --environment prod
After restoration, update the application to use the new database:
# Get new database endpoint
NEW_ENDPOINT=$(aws rds describe-db-instances \
  --db-instance-identifier govtech-prod-postgres-restored-TIMESTAMP \
  --query 'DBInstances[0].Endpoint.Address' \
  --output text)

# Update ConfigMap
kubectl edit configmap govtech-config -n govtech
# Update DB_HOST to $NEW_ENDPOINT

# Restart backend
kubectl rollout restart deployment/backend -n govtech

# Verify
./tests/e2e/test-deployment.sh

Security Response

Timeline for security breach response:
1

T+0 min: Detection

Detection via CloudWatch alarm, GuardDuty, or security report
2

T+5 min: Notification

Alert on-call team (phone call, not just Slack message)
3

T+15 min: Containment

Revoke compromised credentials immediately
chmod +x disaster-recovery/scripts/security-response.sh
./disaster-recovery/scripts/security-response.sh \
  --user <compromised-user>
4

T+30 min: Scope Assessment

Determine extent of breach:
  • Check CloudTrail logs
  • Review GuardDuty findings
  • Identify affected resources
5

T+60 min: Remediation Decision

Decide: restore from clean backup OR patch in place
6

T+120 min: Notification

If data compromised, notify affected parties per regulations

Responsibility Matrix

ScenarioDetectionResponseApproval
AZ FailureCloudWatch (auto)--
Region FailureOn-call / CloudWatchInfrastructure + DevOpsProject Lead
DB CorruptionDevOps / UsersInfrastructureProject Lead
Security BreachDevOps / SecurityEntire TeamLead + Authorities

DR Testing Schedule

Regular testing ensures the DR plan works when needed:
TestFrequencyType
DB backup restoration (dev)MonthlyAutomated
RDS Multi-AZ failoverQuarterlySimulated
EKS cluster recreation (dev)Semi-annuallyManual
Full region disaster drillAnnuallyManual

Test Backup Restoration

# Verify backup without actual restoration
./disaster-recovery/scripts/restore-database.sh \
  --snapshot latest \
  --environment dev \
  --verify-only

Emergency Contacts

RoleResponsibilityContact
Project LeadFinal restoration approvalPhone (24/7)
InfrastructureTerraform, VPC, RDSSlack #govtech-infra
DeploymentKubernetes, applicationSlack #govtech-deploy
DevOpsCI/CD, monitoringSlack #govtech-devops
Escalation PolicyIf no progress after 30 minutes, escalate to next level immediately.

Recovery Checklist

Use this checklist for any disaster scenario:
  • Identify and document incident (time, probable cause, scope)
  • Notify team and project lead
  • Determine disaster scenario (1-6 above)
  • Execute corresponding procedure
  • Verify service restoration: ./tests/e2e/test-deployment.sh
  • Communicate to users if downtime occurred
  • Document post-mortem (root cause, actions taken, prevention)
  • Update DR plan if gaps found in procedure

Backup Locations

RDS Snapshots

  • Location: AWS RDS automated backups
  • Retention: 30 days (prod), 3 days (dev)
  • Encryption: KMS encrypted
  • Region: us-east-1 (same as primary)

Application Backups

  • Location: S3 bucket govtech-{env}-app-storage-835960996869
  • Prefix: backups/postgresql/
  • Format: pg_dump custom format (compressed)
  • Retention: 30 days
  • Schedule: Daily at 2am UTC

Terraform State

  • Location: S3 bucket govtech-terraform-state-835960996869
  • Versioning: Enabled
  • Encryption: KMS encrypted
  • Backup: All versions retained indefinitely

Backup & Restore

Detailed backup procedures and restoration steps

Troubleshooting

Common issues and solutions

Build docs developers (and LLMs) love