Backup strategies, disaster recovery procedures, and cluster restoration for Talos Linux.
Disaster recovery planning ensures you can restore your Talos cluster in the event of catastrophic failures. This guide covers backup strategies, recovery procedures, and best practices.
#!/bin/bash# Backup all machine configurationsBACKUP_DIR="/backups/configs"DATE=$(date +%Y%m%d)mkdir -p ${BACKUP_DIR}/${DATE}# Backup control plane configsfor i in 1 2 3; do talosctl --nodes 10.0.0.$((i+1)) get machineconfig -o yaml > \ ${BACKUP_DIR}/${DATE}/controlplane-${i}.yamldone# Backup worker configsfor i in 1 2; do talosctl --nodes 10.0.0.$((i+4)) get machineconfig -o yaml > \ ${BACKUP_DIR}/${DATE}/worker-${i}.yamldoneecho "Configuration backup complete: ${BACKUP_DIR}/${DATE}"
If one control plane node fails but others are healthy:
1
Verify cluster health
# Check remaining control plane nodestalosctl --nodes 10.0.0.2,10.0.0.4 etcd memberskubectl get nodes
2
Remove failed member
# Get member ID of failed nodetalosctl --nodes 10.0.0.2 etcd members# Remove the failed membertalosctl --nodes 10.0.0.2 etcd remove-member <member-id>
3
Replace the node
# Apply configuration to new nodetalosctl apply-config --insecure \ --nodes 10.0.0.3 \ --file controlplane.yaml# Wait for node to jointalosctl --nodes 10.0.0.3 health --wait-timeout 10m
If all control plane nodes are lost or etcd data is corrupted:
1
Prepare for recovery
Ensure you have:
Recent etcd snapshot
Machine configurations
talosconfig file
ls -lh etcd-backup.dbls -lh *.yaml
2
Deploy first control plane node
# Apply configuration to first control plane nodetalosctl apply-config --insecure \ --nodes 10.0.0.2 \ --file controlplane.yaml
3
Bootstrap from snapshot
# Restore etcd from backuptalosctl --nodes 10.0.0.2 bootstrap --recover-from=etcd-backup.db
Example output:
recovering from snapshot "etcd-backup.db": hash 12ab34cd, revision 123456, total keys 5234, total size 25165824
4
Wait for cluster recovery
# Wait for Kubernetes API to be availablekubectl get nodes# Verify cluster statekubectl get pods -Atalosctl --nodes 10.0.0.2 etcd members
5
Add additional control plane nodes
# Add second control plane nodetalosctl apply-config --insecure \ --nodes 10.0.0.3 \ --file controlplane.yaml# Add third control plane nodetalosctl apply-config --insecure \ --nodes 10.0.0.4 \ --file controlplane.yaml# Verify all members joinedtalosctl --nodes 10.0.0.2 etcd members
6
Re-add worker nodes
# Apply worker configurationsfor i in 5 6; do talosctl apply-config --insecure \ --nodes 10.0.0.$i \ --file worker.yamldone# Verify cluster healthkubectl get nodes
# Use production backup to create test cluster# This validates your backup and recovery procedures# 1. Deploy test control planetalosctl apply-config --insecure --nodes 10.1.0.2 --file test-controlplane.yaml# 2. Restore from production backuptalosctl --nodes 10.1.0.2 bootstrap --recover-from=prod-etcd-backup.db# 3. Verify cluster statekubectl get nodeskubectl get pods -A# 4. Document any issues or timing