Disaster recovery and backup

This guide covers disaster recovery strategies, backup approaches, and recovery procedures for Microsoft 365 configurations managed with Terraform. Learn how to protect your infrastructure-as-code and rapidly recover from configuration loss, tenant corruption, or accidental deletion.

Overview

When managing Microsoft 365 with Terraform, your disaster recovery strategy must address:

Terraform state backup and recovery - Protecting your source of truth
Configuration code versioning - Version control for .tf files
Resource configuration backup - Point-in-time snapshots of deployed resources
Credential recovery - Secure backup of authentication credentials
Multi-region resilience - Geographic distribution for business continuity

Microsoft 365 tenants do not have native “restore to point in time” capabilities for all configuration types. Terraform becomes your primary disaster recovery mechanism for configuration-as-code.

Understanding the disaster recovery scope

What Terraform protects

Configuration policies

Conditional Access policies
Compliance policies
Device configuration profiles
App protection policies

Identity resources

Group definitions and memberships
Named locations
Authentication strength policies
Custom security attributes

Intune resources

Device enrollment configurations
App assignments
Update rings
PowerShell scripts

Organizational structure

Administrative units
Role assignments
Settings catalog policies
Application registrations

What Terraform does NOT protect

Terraform manages configuration, not data. The following require separate backup strategies:

User data: Email, OneDrive files, SharePoint content
Historical data: Audit logs, sign-in logs, compliance reports
License assignments: User-to-license mappings (unless managed via Terraform)
External identities: B2B guest users, external access configurations
Tenant settings: Some global tenant settings not exposed via Graph API

For comprehensive Microsoft 365 backup including user data, consider third-party backup solutions (Veeam, AvePoint, Commvault) alongside your Terraform disaster recovery strategy.

State file backup strategies

The Terraform state file is your most critical asset for disaster recovery. Loss of state means loss of resource tracking and potential configuration drift.

Strategy 1: Remote state with versioning (Recommended)

Azure Storage
Terraform Cloud
S3 with Versioning

terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state-rg"
    storage_account_name = "tfstateproduction"
    container_name       = "tfstate"
    key                  = "m365.terraform.tfstate"
    
    # Enable versioning and soft delete in Azure Storage Account
  }
}

Enable versioning in Azure Storage:

# Enable blob versioning
az storage account blob-service-properties update \
  --account-name tfstateproduction \
  --enable-versioning true

# Enable soft delete (retain deleted blobs for 30 days)
az storage account blob-service-properties update \
  --account-name tfstateproduction \
  --enable-delete-retention true \
  --delete-retention-days 30

terraform {
  cloud {
    organization = "my-organization"
    
    workspaces {
      name = "m365-production"
    }
  }
}

Benefits:

Automatic state versioning (full history)
Point-in-time state recovery
Built-in state locking
Encrypted at rest
No manual backup configuration required

Recovery:

# List state versions
terraform state list

# Rollback to previous version via UI or API
# Terraform Cloud UI → Workspace → States → Select version → Restore

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "m365/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
    
    # Requires versioning enabled on S3 bucket
  }
}

Enable versioning on S3 bucket:

aws s3api put-bucket-versioning \
  --bucket my-terraform-state \
  --versioning-configuration Status=Enabled

# Enable lifecycle policy to retain versions
# Delete noncurrent versions after 90 days

Strategy 2: Automated state backups

Schedule regular state file backups to secondary storage:

backup-state.sh

#!/bin/bash
# Schedule via cron: 0 */6 * * * /path/to/backup-state.sh

BACKUP_DIR="/secure/terraform-backups/m365"
DATE=$(date +%Y%m%d-%H%M%S)

# Pull current state
terraform state pull > "${BACKUP_DIR}/terraform-${DATE}.tfstate"

# Compress and encrypt
tar -czf - "${BACKUP_DIR}/terraform-${DATE}.tfstate" | \
  gpg --encrypt --recipient [email protected] > \
  "${BACKUP_DIR}/terraform-${DATE}.tfstate.tar.gz.gpg"

# Remove unencrypted backup
rm "${BACKUP_DIR}/terraform-${DATE}.tfstate"

# Retain last 30 days of backups
find "${BACKUP_DIR}" -name "terraform-*.tfstate.tar.gz.gpg" -mtime +30 -delete

echo "State backup completed: ${DATE}"

Best practice: Store state backups in a different geographic region and different cloud provider than your primary state backend for maximum resilience.

Configuration code backup

Your Terraform configuration files (.tf, .tfvars) must be version controlled and backed up.

Version control best practices

Use Git with remote repositories

# Initialize Git repository
git init
git add *.tf *.md
git commit -m "Initial Microsoft 365 Terraform configuration"

# Push to remote (GitHub, GitLab, Azure DevOps)
git remote add origin https://github.com/org/m365-terraform.git
git push -u origin main

Implement branch protection

Require pull request reviews before merge
Enforce status checks (terraform validate, terraform plan)
Prevent force pushes to main/production branches
Require signed commits

Tag releases for rollback points

# Tag stable configurations
git tag -a v1.0.0 -m "Production release - 2024-03-01"
git push origin v1.0.0

# Rollback to previous version
git checkout v1.0.0
terraform apply

Mirror repositories across providers

# Primary: GitHub
git remote add github https://github.com/org/m365-terraform.git

# Mirror: GitLab
git remote add gitlab https://gitlab.com/org/m365-terraform.git

# Mirror: Azure DevOps
git remote add azure https://dev.azure.com/org/project/_git/m365-terraform

# Push to all remotes
git push --all github gitlab azure

.gitignore for security

.gitignore

# Terraform state files (use remote backend instead)
*.tfstate
*.tfstate.*
terraform.tfstate.backup

# Sensitive variable files
*.tfvars
!terraform.tfvars.example
secrets.auto.tfvars

# Terraform directories
.terraform/
.terraform.lock.hcl

# Crash logs
crash.log
crash.*.log

# Override files
override.tf
override.tf.json
*_override.tf
*_override.tf.json

# CLI configuration files
.terraformrc
terraform.rc

# Environment files
.env
.env.local

Recovery scenarios and procedures

Scenario 1: Accidental resource deletion

Symptom: Critical conditional access policy accidentally deleted via Azure portal

Identify deleted resource

# Run terraform plan to detect drift
terraform plan -parallelism=1

# Output shows resource missing from tenant:
# Plan: 1 to add, 0 to change, 0 to destroy.

Restore from Terraform

# Re-apply configuration to recreate resource
terraform apply -parallelism=1

# Terraform will recreate the deleted resource with same configuration

Verify restoration

# Confirm resource restored
terraform plan -parallelism=1
# Output: No changes. Your infrastructure matches the configuration.

Prevention: Enable Azure AD audit logs and set up alerts for policy deletions.

Scenario 2: Corrupted Terraform state

Symptom: Terraform state file corrupted, showing resources that don’t exist or missing known resources

Stop all Terraform operations

Prevent concurrent operations from worsening corruption

Restore from backup

# For Terraform Cloud
# UI → Workspace → States → Select previous version → Restore

# For Azure Storage with versioning
az storage blob download \
  --account-name tfstateproduction \
  --container-name tfstate \
  --name m365.terraform.tfstate \
  --version-id <previous-version-id> \
  --file terraform.tfstate.backup

# Push restored state
terraform state push terraform.tfstate.backup

Verify state integrity

# Validate state
terraform validate

# Check plan
terraform plan -parallelism=1

# Ensure no unexpected changes

Scenario 3: Complete tenant loss (catastrophic)

Symptom: Microsoft 365 tenant deleted, corrupted, or requires complete rebuild

Provision new tenant

Create new Microsoft 365 tenant
Configure basic tenant settings
Create service principal for Terraform

Update Terraform configuration

# Update tenant_id in variables
variable "tenant_id" {
  default = "<new-tenant-id>"
}

# Update provider configuration
provider "microsoft365" {
  tenant_id = var.tenant_id
  # ... rest of config
}

Initialize new state

# Remove existing state (backed up separately)
rm -rf .terraform/

# Initialize with new backend
terraform init

Deploy from code

# Preview deployment
terraform plan -parallelism=1

# Deploy all resources to new tenant
terraform apply -parallelism=1 -auto-approve

# Expected duration: 30-90 minutes for typical deployment

Verify and test

Test conditional access policies
Verify group memberships
Validate Intune configurations
Test user sign-in flows

Recovery time objective (RTO): 2-4 hours for complete tenant rebuild from Terraform code

Scenario 4: Configuration drift detection and correction

Symptom: Resources modified outside Terraform (via Azure portal, PowerShell, Graph API)

# Detect drift
terraform plan -refresh-only -parallelism=1

# Review detected changes
# Terraform will show all configuration differences

# Option 1: Revert to Terraform-defined state
terraform apply -parallelism=1

# Option 2: Accept drift and update Terraform code
# Edit .tf files to match current state, then:
terraform plan -parallelism=1  # Should show no changes

# Option 3: Refresh state without applying changes
terraform apply -refresh-only -parallelism=1

Prevention: Implement continuous drift detection via scheduled CI/CD jobs.

Backup automation with CI/CD

GitHub Actions scheduled backup

.github/workflows/backup-state.yml

name: Backup Terraform State

on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours
  workflow_dispatch:  # Manual trigger

jobs:
  backup:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.14.0
      
      - name: Pull current state
        run: terraform state pull > terraform-$(date +%Y%m%d-%H%M%S).tfstate
        env:
          ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
          ARM_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }}
          ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
          ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
      
      - name: Upload to artifact storage
        uses: actions/upload-artifact@v4
        with:
          name: terraform-state-backup-${{ github.run_number }}
          path: terraform-*.tfstate
          retention-days: 90
      
      - name: Upload to secondary storage
        run: |
          # Upload to S3, Azure Blob, etc.
          az storage blob upload \
            --account-name backupstorage \
            --container-name terraform-backups \
            --name m365/terraform-$(date +%Y%m%d-%H%M%S).tfstate \
            --file terraform-*.tfstate

Continuous drift detection

.github/workflows/drift-detection.yml

name: Detect Configuration Drift

on:
  schedule:
    - cron: '0 8 * * *'  # Daily at 8 AM UTC
  workflow_dispatch:

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - uses: hashicorp/setup-terraform@v3
      
      - name: Terraform Init
        run: terraform init
      
      - name: Detect Drift
        id: plan
        run: |
          terraform plan -detailed-exitcode -parallelism=1 > plan_output.txt
          echo "exit_code=$?" >> $GITHUB_OUTPUT
        continue-on-error: true
      
      - name: Post drift alert
        if: steps.plan.outputs.exit_code == '2'
        run: |
          # Send alert to Slack, Teams, email, etc.
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -H 'Content-Type: application/json' \
            -d '{"text":"Configuration drift detected in M365 Terraform!"}'
      
      - name: Upload drift report
        if: steps.plan.outputs.exit_code == '2'
        uses: actions/upload-artifact@v4
        with:
          name: drift-report-${{ github.run_number }}
          path: plan_output.txt

Recovery testing

A disaster recovery plan is only as good as its last successful test. Schedule regular recovery drills.

Quarterly recovery drill checklist

State recovery test

Restore state from 7-day-old backup
Run terraform plan to verify consistency
Document time to restore: ___ minutes

Configuration rollback test

Checkout Git tag from 30 days ago
Deploy to test tenant
Verify all resources created successfully
Document deployment time: ___ minutes

Tenant rebuild simulation

Provision new test tenant
Deploy full Terraform configuration from scratch
Verify functionality of all components
Document total recovery time: ___ hours

Credential recovery test

Retrieve credentials from backup vault
Authenticate to test tenant
Run Terraform operations
Document credential recovery time: ___ minutes

Best practices summary

The 3-2-1 backup rule

3 copies of state and configuration (production + 2 backups)
2 different storage types (cloud storage + Git repositories)
1 offsite/geographic copy (different region/cloud provider)

Automate everything

Scheduled state backups (every 6 hours minimum)
Continuous drift detection (daily)
Automated alerts for changes or drift
Backup verification tests (weekly)

Document recovery procedures

Maintain runbooks for each recovery scenario
Include step-by-step instructions
Document RTO/RPO for each scenario
Update after each recovery test

Secure credential storage

Never commit credentials to Git
Use separate secret management (Key Vault, Vault, etc.)
Rotate credentials regularly
Backup credentials to encrypted offline storage

Test recovery regularly

Quarterly full recovery drills
Monthly state restoration tests
Annual tenant rebuild simulations
Document results and improve procedures

Terraform state

Official Terraform state documentation

Remote backends

Configuring remote state backends

Multi-tenant management

Managing multiple M365 tenants

Workspace design

Workspace architecture patterns

Getting Started

Authentication

Configuration

Core Concepts

Guides

Overview

Understanding the disaster recovery scope

What Terraform protects

Configuration policies

Identity resources

Intune resources

Organizational structure

What Terraform does NOT protect

State file backup strategies

Strategy 1: Remote state with versioning (Recommended)

Strategy 2: Automated state backups

Configuration code backup

Version control best practices

.gitignore for security

Recovery scenarios and procedures

Scenario 1: Accidental resource deletion

Scenario 2: Corrupted Terraform state

Scenario 3: Complete tenant loss (catastrophic)

Scenario 4: Configuration drift detection and correction

Backup automation with CI/CD

GitHub Actions scheduled backup

Continuous drift detection

Recovery testing

Quarterly recovery drill checklist

Best practices summary

Terraform state

Remote backends

Multi-tenant management

Workspace design

Build docs developers (and LLMs) love

Getting Started

Authentication

Configuration

Core Concepts

Guides

​Overview

​Understanding the disaster recovery scope

​What Terraform protects

Configuration policies

Identity resources

Intune resources

Organizational structure

​What Terraform does NOT protect

​State file backup strategies

​Strategy 1: Remote state with versioning (Recommended)

​Strategy 2: Automated state backups

​Configuration code backup

​Version control best practices

​.gitignore for security

​Recovery scenarios and procedures

​Scenario 1: Accidental resource deletion

​Scenario 2: Corrupted Terraform state

​Scenario 3: Complete tenant loss (catastrophic)

​Scenario 4: Configuration drift detection and correction

​Backup automation with CI/CD

​GitHub Actions scheduled backup

​Continuous drift detection

​Recovery testing

​Quarterly recovery drill checklist

​Best practices summary

​Related resources

Terraform state

Remote backends

Multi-tenant management

Workspace design

Build docs developers (and LLMs) love

Overview

Understanding the disaster recovery scope

What Terraform protects

What Terraform does NOT protect

State file backup strategies

Strategy 1: Remote state with versioning (Recommended)

Strategy 2: Automated state backups

Configuration code backup

Version control best practices

.gitignore for security

Recovery scenarios and procedures

Scenario 1: Accidental resource deletion

Scenario 2: Corrupted Terraform state

Scenario 3: Complete tenant loss (catastrophic)

Scenario 4: Configuration drift detection and correction

Backup automation with CI/CD

GitHub Actions scheduled backup

Continuous drift detection

Recovery testing

Quarterly recovery drill checklist

Best practices summary

Related resources