Disaster recovery and backup strategies for Microsoft 365 configurations managed with Terraform
This guide covers disaster recovery strategies, backup approaches, and recovery procedures for Microsoft 365 configurations managed with Terraform. Learn how to protect your infrastructure-as-code and rapidly recover from configuration loss, tenant corruption, or accidental deletion.
When managing Microsoft 365 with Terraform, your disaster recovery strategy must address:
Terraform state backup and recovery - Protecting your source of truth
Configuration code versioning - Version control for .tf files
Resource configuration backup - Point-in-time snapshots of deployed resources
Credential recovery - Secure backup of authentication credentials
Multi-region resilience - Geographic distribution for business continuity
Microsoft 365 tenants do not have native “restore to point in time” capabilities for all configuration types. Terraform becomes your primary disaster recovery mechanism for configuration-as-code.
Tenant settings: Some global tenant settings not exposed via Graph API
For comprehensive Microsoft 365 backup including user data, consider third-party backup solutions (Veeam, AvePoint, Commvault) alongside your Terraform disaster recovery strategy.
The Terraform state file is your most critical asset for disaster recovery. Loss of state means loss of resource tracking and potential configuration drift.
# List state versionsterraform state list# Rollback to previous version via UI or API# Terraform Cloud UI → Workspace → States → Select version → Restore
Schedule regular state file backups to secondary storage:
backup-state.sh
#!/bin/bash# Schedule via cron: 0 */6 * * * /path/to/backup-state.shBACKUP_DIR="/secure/terraform-backups/m365"DATE=$(date +%Y%m%d-%H%M%S)# Pull current stateterraform state pull > "${BACKUP_DIR}/terraform-${DATE}.tfstate"# Compress and encrypttar -czf - "${BACKUP_DIR}/terraform-${DATE}.tfstate" | \ gpg --encrypt --recipient[email protected] > \ "${BACKUP_DIR}/terraform-${DATE}.tfstate.tar.gz.gpg"# Remove unencrypted backuprm "${BACKUP_DIR}/terraform-${DATE}.tfstate"# Retain last 30 days of backupsfind "${BACKUP_DIR}" -name "terraform-*.tfstate.tar.gz.gpg" -mtime +30 -deleteecho "State backup completed: ${DATE}"
Best practice: Store state backups in a different geographic region and different cloud provider than your primary state backend for maximum resilience.
Symptom: Critical conditional access policy accidentally deleted via Azure portal
1
Identify deleted resource
# Run terraform plan to detect driftterraform plan -parallelism=1# Output shows resource missing from tenant:# Plan: 1 to add, 0 to change, 0 to destroy.
2
Restore from Terraform
# Re-apply configuration to recreate resourceterraform apply -parallelism=1# Terraform will recreate the deleted resource with same configuration
3
Verify restoration
# Confirm resource restoredterraform plan -parallelism=1# Output: No changes. Your infrastructure matches the configuration.
Prevention: Enable Azure AD audit logs and set up alerts for policy deletions.
# Remove existing state (backed up separately)rm -rf .terraform/# Initialize with new backendterraform init
4
Deploy from code
# Preview deploymentterraform plan -parallelism=1# Deploy all resources to new tenantterraform apply -parallelism=1 -auto-approve# Expected duration: 30-90 minutes for typical deployment
5
Verify and test
Test conditional access policies
Verify group memberships
Validate Intune configurations
Test user sign-in flows
Recovery time objective (RTO): 2-4 hours for complete tenant rebuild from Terraform code
# Detect driftterraform plan -refresh-only -parallelism=1# Review detected changes# Terraform will show all configuration differences# Option 1: Revert to Terraform-defined stateterraform apply -parallelism=1# Option 2: Accept drift and update Terraform code# Edit .tf files to match current state, then:terraform plan -parallelism=1 # Should show no changes# Option 3: Refresh state without applying changesterraform apply -refresh-only -parallelism=1
Prevention: Implement continuous drift detection via scheduled CI/CD jobs.