etcd is the distributed key-value store that serves as the backing store for all Kubernetes cluster data. Proper etcd management is critical for cluster reliability and disaster recovery.
etcd Cluster Overview
In Talos, etcd runs on control plane nodes and is automatically configured and managed by the system. Each control plane node runs an etcd member that participates in the etcd cluster.
Viewing etcd Members
List all etcd cluster members:
talosctl --nodes 10.0.0.2 etcd members
Example output:
NODE ID HOSTNAME PEER URLS CLIENT URLS LEARNER
10.0.0.2 6457a4e8ecba5c61 controlplane-1 https://10.0.0.2:2380 https://10.0.0.2:2379 false
10.0.0.3 7d3c4c7e8f9a1b2c controlplane-2 https://10.0.0.3:2380 https://10.0.0.3:2379 false
10.0.0.4 8e4d5d8f9g0b2c3d controlplane-3 https://10.0.0.4:2380 https://10.0.0.4:2379 false
Checking etcd Status
View detailed status of etcd members:
talosctl --nodes 10.0.0.2,10.0.0.3,10.0.0.4 etcd status
Example output:
NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM LEARNER
10.0.0.2 6457a4e8ecba5c61 25 MB 18 MB (72.00%) 6457a4e8ecba5c61 123456 12 false
10.0.0.3 7d3c4c7e8f9a1b2c 25 MB 18 MB (72.00%) 6457a4e8ecba5c61 123456 12 false
10.0.0.4 8e4d5d8f9g0b2c3d 25 MB 18 MB (72.00%) 6457a4e8ecba5c61 123456 12 false
The status shows:
- DB SIZE: Total database size
- IN USE: Actual data size (percentage used)
- LEADER: Current etcd leader member ID
- RAFT INDEX: Current Raft log index
- RAFT TERM: Current Raft term
etcd Backup
Regular etcd backups are essential for disaster recovery. Talos provides a built-in snapshot mechanism.
Creating a Snapshot
Create an etcd snapshot and save it locally:
talosctl --nodes 10.0.0.2 etcd snapshot db.snapshot
Example output:
etcd snapshot saved to "db.snapshot" (25165824 bytes)
snapshot info: hash 12ab34cd, revision 123456, total keys 5234, total size 25165824
Take snapshots from a single etcd member. All members have the same data, so there’s no need to snapshot all nodes.
Automated Backup Script
Create a script to automate regular backups:
#!/bin/bash
BACKUP_DIR="/backups/etcd"
DATE=$(date +%Y%m%d-%H%M%S)
FILENAME="etcd-snapshot-${DATE}.db"
mkdir -p ${BACKUP_DIR}
talosctl --nodes 10.0.0.2 etcd snapshot ${BACKUP_DIR}/${FILENAME}
if [ $? -eq 0 ]; then
echo "Backup successful: ${FILENAME}"
# Keep only last 7 days of backups
find ${BACKUP_DIR} -name "etcd-snapshot-*.db" -mtime +7 -delete
else
echo "Backup failed!"
exit 1
fi
Run this script via cron for automated backups:
# Daily backup at 2 AM
0 2 * * * /usr/local/bin/etcd-backup.sh
Verifying Snapshot Integrity
The snapshot command automatically verifies the checksum. The output includes the hash which can be used to verify integrity later.
etcd Restore
Restore etcd from a snapshot during disaster recovery.
Restoring etcd will replace all current cluster data. Only perform this operation if you need to recover from a catastrophic failure.
Bootstrap from Snapshot
Restore etcd cluster from a snapshot:
Prepare the snapshot
Ensure you have a valid etcd snapshot file: Bootstrap with recovery
Bootstrap the first control plane node with the snapshot:talosctl --nodes 10.0.0.2 bootstrap --recover-from=db.snapshot
Example output:recovering from snapshot "db.snapshot": hash 12ab34cd, revision 123456, total keys 5234, total size 25165824
Wait for cluster recovery
Monitor the recovery process:talosctl --nodes 10.0.0.2 etcd members
kubectl get nodes
Other control plane nodes will automatically rejoin the recovered etcd cluster.
Skip Hash Check (Advanced)
If recovering from a data directory copy instead of a proper snapshot:
talosctl --nodes 10.0.0.2 bootstrap \
--recover-from=db.snapshot \
--recover-skip-hash-check
Only use --recover-skip-hash-check when recovering from a data directory backup. This skips integrity verification.
etcd Maintenance
Defragmentation
etcd databases can become fragmented over time, wasting disk space. Defragmentation compacts the database.
talosctl --nodes 10.0.0.2 etcd defrag
Defragmentation is a resource-intensive operation. Only defragment one node at a time and during low-traffic periods.
Best practices for defragmentation:
-
Check current database usage:
talosctl --nodes 10.0.0.2 etcd status
-
Defragment if IN USE percentage is low (e.g., < 60%):
# Defragment each node sequentially
talosctl --nodes 10.0.0.2 etcd defrag
sleep 60
talosctl --nodes 10.0.0.3 etcd defrag
sleep 60
talosctl --nodes 10.0.0.4 etcd defrag
-
Verify database size reduction:
talosctl --nodes 10.0.0.2,10.0.0.3,10.0.0.4 etcd status
Managing etcd Alarms
etcd can trigger alarms when issues occur (e.g., storage quota exceeded).
List active alarms:
talosctl --nodes 10.0.0.2 etcd alarm list
Example output:
NODE MEMBER ALARM
10.0.0.2 6457a4e8ecba5c61 NOSPACE
Disarm alarms (after resolving the underlying issue):
talosctl --nodes 10.0.0.2 etcd alarm disarm
Handling NOSPACE Alarm
If etcd runs out of space:
Identify the issue
talosctl --nodes 10.0.0.2 etcd status
talosctl --nodes 10.0.0.2 etcd alarm list
Free up space
Defragment the database:talosctl --nodes 10.0.0.2 etcd defrag
Disarm the alarm
talosctl --nodes 10.0.0.2 etcd alarm disarm
Verify recovery
talosctl --nodes 10.0.0.2 etcd status
etcd Cluster Operations
Removing a Member
Gracefully remove a control plane node from etcd:
talosctl --nodes 10.0.0.3 etcd leave
This command:
- Removes the node from etcd cluster
- Notifies other members
- Shuts down the local etcd instance
Force Remove a Member
If a node is completely unreachable, force remove it:
# Get member ID from etcd members list
talosctl --nodes 10.0.0.2 etcd members
# Remove by member ID
talosctl --nodes 10.0.0.2 etcd remove-member 7d3c4c7e8f9a1b2c
Only use remove-member when the node is permanently lost or cannot call etcd leave. Always prefer etcd leave over force removal.
Forfeit Leadership
Force an etcd member to give up leadership:
talosctl --nodes 10.0.0.2 etcd forfeit-leadership
Useful when:
- Performing maintenance on the leader node
- Testing leader election
- Troubleshooting leader-specific issues
Monitoring etcd Health
Regular health monitoring helps prevent issues:
Health Check Script
#!/bin/bash
echo "=== etcd Member Status ==="
talosctl --nodes 10.0.0.2,10.0.0.3,10.0.0.4 etcd status
echo ""
echo "=== etcd Alarms ==="
talosctl --nodes 10.0.0.2 etcd alarm list
echo ""
echo "=== etcd Members ==="
talosctl --nodes 10.0.0.2 etcd members
# Check for unhealthy members
UNHEALTHY=$(talosctl --nodes 10.0.0.2,10.0.0.3,10.0.0.4 etcd status | grep -v "false" | wc -l)
if [ $UNHEALTHY -gt 3 ]; then
echo "WARNING: Unhealthy etcd members detected!"
exit 1
fi
echo "etcd cluster is healthy"
Key Metrics to Monitor
- DB Size: Should grow predictably; sudden increases indicate issues
- IN USE %: Values < 50% suggest fragmentation; consider defragmentation
- Leader stability: Frequent leader changes indicate network or performance issues
- Raft index lag: Members should have similar Raft indices; large differences indicate sync issues
Database Size Management
Keep etcd database size under control:
- Regular defragmentation: Run monthly or when IN USE < 60%
- Compact history: Kubernetes automatically compacts old revisions
- Limit object sizes: Avoid storing large objects in Kubernetes
Monitoring Best Practices
- Take daily snapshots and store them off-cluster
- Monitor database size trends
- Set up alerts for:
- etcd alarms
- Database size > 8GB (consider resizing)
- Defragmentation needed (IN USE < 50%)
- Leader election changes
Troubleshooting etcd
Split Brain Scenario
If etcd members can’t reach consensus:
- Check network connectivity between control plane nodes
- Verify etcd member status
- Check for conflicting member IDs
- Restore from backup if necessary
Member Not Syncing
If a member falls behind:
# Check if member is learner or having issues
talosctl --nodes 10.0.0.3 etcd status
# Check logs for errors
talosctl --nodes 10.0.0.3 logs etcd
# If needed, remove and re-add the member
talosctl --nodes 10.0.0.3 etcd leave
# Re-apply the machine configuration
talosctl apply-config --nodes 10.0.0.3 --file controlplane.yaml