Skip to main content

Overview

The GovTech platform implements automated daily backups for all critical data, with separate retention policies for production and development environments.

Backup Strategy

RDS Automated Backups

AWS-managed automated snapshots with point-in-time recovery

PostgreSQL Dumps

Custom pg_dump backups stored in S3 for additional safety

Terraform State

Versioned infrastructure state in S3

Application Data

User uploads and files in S3 with versioning

Automated PostgreSQL Backup

Backup Schedule

EnvironmentFrequencyTime (UTC)Retention
ProductionDaily2:00 AM30 days
StagingDaily2:00 AM14 days
DevelopmentDaily2:00 AM7 days

Ansible Playbook

The backup process is automated using Ansible: ansible/playbooks/backup.yml What it does:
  1. Connects to PostgreSQL pod in Kubernetes
  2. Executes pg_dump to create complete database backup
  3. Compresses backup with gzip (level 9)
  4. Uploads to S3 with date-stamped filename
  5. Verifies backup integrity
  6. Cleans up temporary files
  7. Deletes backups older than retention period

Manual Backup Execution

# Run backup manually
ansible-playbook -i ansible/inventory/hosts.yml \
  ansible/playbooks/backup.yml

# Override retention period
ansible-playbook ansible/playbooks/backup.yml \
  -e "retention_days=30 namespace=govtech"

Scheduled Execution

Set up automated backups using cron or AWS EventBridge:
# Edit crontab
crontab -e

# Add this line for daily backup at 2am UTC
0 2 * * * ansible-playbook /opt/govtech/ansible/playbooks/backup.yml >> /var/log/govtech-backup.log 2>&1

Backup Process Details

Step 1: Database Dump

The playbook uses PostgreSQL’s pg_dump with custom format:
pg_dump \
  --username=govtech_admin \
  --dbname=govtech \
  --format=custom \
  --compress=9 \
  --file=/tmp/govtech_YYYYMMDD_HHMM.dump
Why custom format?
  • More efficient than SQL plain text
  • Enables parallel restoration
  • Allows selective table restoration
  • Built-in compression

Step 2: Copy to Local

kubectl cp \
  govtech/postgres-0:/tmp/govtech_YYYYMMDD_HHMM.dump \
  /tmp/govtech-backups/govtech_YYYYMMDD_HHMM.dump

Step 3: Integrity Verification

# Verify backup is valid without restoring
pg_restore --list /tmp/govtech_YYYYMMDD_HHMM.dump

# Count tables included
echo "$RESTORE_OUTPUT" | grep -c 'TABLE DATA'
The backup process validates the dump file before uploading to S3, ensuring you never store corrupted backups.

Step 4: Upload to S3

aws s3 cp /tmp/govtech_YYYYMMDD_HHMM.dump \
  s3://govtech-prod-app-storage-835960996869/backups/postgresql/ \
  --storage-class STANDARD_IA \
  --region us-east-1
Storage Class: STANDARD_IA (Infrequent Access)
  • Lower cost than STANDARD
  • Suitable for backups (rarely accessed)
  • Same durability (99.999999999%)

Step 5: Cleanup and Rotation

# Delete backups older than retention_days
aws s3api list-objects-v2 \
  --bucket govtech-prod-app-storage-835960996869 \
  --prefix backups/postgresql/ \
  --query "Contents[?LastModified<='RETENTION_DATE'].Key" \
  --output text | xargs -I {} aws s3 rm s3://BUCKET/{}

RDS Automated Backups

Configuration

RDS provides automated snapshots configured via Terraform:
terraform/modules/database/aws.tf
resource "aws_db_instance" "govtech" {
  backup_retention_period = 30  # Production: 30 days
  backup_window          = "02:00-03:00"  # UTC
  preferred_backup_window = "02:00-03:00"
  
  # Enable automated backups
  skip_final_snapshot = false
  final_snapshot_identifier = "govtech-prod-final-snapshot"
}

List RDS Snapshots

# List automated snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier govtech-prod-postgres \
  --snapshot-type automated \
  --query 'DBSnapshots[*].{Date:SnapshotCreateTime,ID:DBSnapshotIdentifier,Size:AllocatedStorage}' \
  --output table

# List manual snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier govtech-prod-postgres \
  --snapshot-type manual \
  --output table

Create Manual Snapshot

# Create manual snapshot (kept until manually deleted)
aws rds create-db-snapshot \
  --db-instance-identifier govtech-prod-postgres \
  --db-snapshot-identifier govtech-prod-manual-$(date +%Y%m%d-%H%M)

Restoration Procedures

Restore from RDS Snapshot

Use the automated restoration script:
disaster-recovery/scripts/restore-database.sh
chmod +x disaster-recovery/scripts/restore-database.sh

# Restore from latest snapshot
./disaster-recovery/scripts/restore-database.sh \
  --snapshot latest \
  --environment prod

# Restore from specific snapshot
./disaster-recovery/scripts/restore-database.sh \
  --snapshot rds:govtech-prod-postgres-2026-03-01-02-00 \
  --environment prod
Production Restoration ApprovalThe script requires approval code for production: RESTORE-PROD-YYYYMMDDThis prevents accidental production restores.

Script Workflow

1

Identify Snapshot

If --snapshot latest, automatically finds most recent snapshot
aws rds describe-db-snapshots \
  --db-instance-identifier govtech-prod-postgres \
  --snapshot-type automated \
  --query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier'
2

Validate Snapshot

Verify snapshot exists and is in ‘available’ state
aws rds describe-db-snapshots \
  --db-snapshot-identifier $SNAPSHOT_ID \
  --query 'DBSnapshots[0].{Date:SnapshotCreateTime,Size:AllocatedStorage,Status:Status}'
3

Request Approval (Production Only)

For production environments, require approval code
4

Restore to New Instance

Creates new RDS instance from snapshot
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier govtech-prod-postgres-restored-TIMESTAMP \
  --db-snapshot-identifier $SNAPSHOT_ID \
  --db-instance-class db.t3.micro \
  --no-publicly-accessible
The script creates a NEW instance rather than overwriting the existing one. This allows rollback if needed.
5

Wait for Availability

aws rds wait db-instance-available \
  --db-instance-identifier govtech-prod-postgres-restored-TIMESTAMP
Typical time: 10-30 minutes depending on database size
6

Get New Endpoint

NEW_ENDPOINT=$(aws rds describe-db-instances \
  --db-instance-identifier govtech-prod-postgres-restored-TIMESTAMP \
  --query 'DBInstances[0].Endpoint.Address' \
  --output text)

Update Application

After restoration, point the application to the new database:
# Update Kubernetes ConfigMap
kubectl edit configmap govtech-config -n govtech
# Change DB_HOST to new endpoint

# Restart backend pods
kubectl rollout restart deployment/backend -n govtech

# Verify connectivity
kubectl exec -it deploy/backend -n govtech -- \
  psql -h $NEW_ENDPOINT -U govtech_admin -d govtech -c "SELECT version();"

# Run E2E tests
./tests/e2e/test-deployment.sh

Restore from S3 Backup (pg_dump)

If RDS snapshots are unavailable, restore from S3 pg_dump backups:
1

Download Backup from S3

# List available backups
aws s3 ls s3://govtech-prod-app-storage-835960996869/backups/postgresql/

# Download specific backup
aws s3 cp \
  s3://govtech-prod-app-storage-835960996869/backups/postgresql/govtech_20260301_0200.dump \
  /tmp/restore.dump
2

Copy to PostgreSQL Pod

kubectl cp /tmp/restore.dump \
  govtech/postgres-0:/tmp/restore.dump
3

Drop and Recreate Database

This will delete all current data! Ensure you have a backup of the current state.
kubectl exec -it postgres-0 -n govtech -- bash

# Inside the pod
psql -U postgres
DROP DATABASE govtech;
CREATE DATABASE govtech;
\q
4

Restore from Dump

pg_restore \
  --username=postgres \
  --dbname=govtech \
  --verbose \
  --no-owner \
  --no-acl \
  /tmp/restore.dump
5

Verify Restoration

psql -U postgres -d govtech

# Check tables
\dt

# Check row counts
SELECT count(*) FROM users;
SELECT count(*) FROM documents;

Backup Monitoring

CloudWatch Alarms

Set up alarms for backup failures:
# Create SNS topic for backup alerts
aws sns create-topic --name govtech-backup-alerts

# Subscribe email to topic
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:ACCOUNT:govtech-backup-alerts \
  --protocol email \
  --notification-endpoint [email protected]

Verify Backup Success

# Check latest backup in S3
aws s3 ls s3://govtech-prod-app-storage-835960996869/backups/postgresql/ \
  --recursive --human-readable --summarize | tail -10

# Verify backup age (should be < 24 hours)
LATEST=$(aws s3api list-objects-v2 \
  --bucket govtech-prod-app-storage-835960996869 \
  --prefix backups/postgresql/ \
  --query 'sort_by(Contents, &LastModified)[-1].LastModified' \
  --output text)

echo "Latest backup: $LATEST"

Backup Best Practices

1

3-2-1 Backup Rule

  • 3 copies of data (original + 2 backups)
  • 2 different storage types (RDS snapshots + S3 dumps)
  • 1 off-site copy (enable S3 cross-region replication for production)
2

Test Restorations Monthly

# Test restore in dev environment
./disaster-recovery/scripts/restore-database.sh \
  --snapshot latest \
  --environment dev \
  --verify-only
3

Document Restoration Time

Track actual restoration times to verify RTO compliance:
  • Dev: ~15 minutes
  • Staging: ~25 minutes
  • Production: ~45 minutes (larger database)
4

Encrypt All Backups

  • RDS snapshots: Encrypted with KMS
  • S3 backups: Server-side encryption enabled
  • In-transit: TLS for all transfers
5

Automate Verification

Run automated tests after backup:
  • File size > 0
  • pg_restore --list succeeds
  • Table count matches expected

Storage Costs

RDS Snapshots

  • Cost: $0.095/GB-month (us-east-1)
  • Estimated: ~$5-10/month for typical production database
  • Retention: 30 days automated, manual snapshots until deleted

S3 Backups (STANDARD_IA)

  • Storage: $0.0125/GB-month
  • Retrieval: $0.01/GB (infrequent)
  • Estimated: ~$2-5/month for compressed dumps

Terraform State Backups

Terraform state is automatically versioned in S3:
# List all versions of state file
aws s3api list-object-versions \
  --bucket govtech-terraform-state-835960996869 \
  --prefix prod/terraform.tfstate \
  --query 'Versions[*].{Date:LastModified,VersionId:VersionId,Size:Size}' \
  --output table

# Restore specific version
aws s3api get-object \
  --bucket govtech-terraform-state-835960996869 \
  --key prod/terraform.tfstate \
  --version-id <VERSION_ID> \
  terraform.tfstate.restored

Disaster Recovery

Complete DR plan with RTO/RPO targets

Troubleshooting

Common backup and restore issues

Build docs developers (and LLMs) love