etcd Management

etcd is the distributed key-value store that serves as the backing store for all Kubernetes cluster data. Proper etcd management is critical for cluster reliability and disaster recovery.

etcd Cluster Overview

In Talos, etcd runs on control plane nodes and is automatically configured and managed by the system. Each control plane node runs an etcd member that participates in the etcd cluster.

Viewing etcd Members

List all etcd cluster members:

talosctl --nodes 10.0.0.2 etcd members

Example output:

NODE         ID               HOSTNAME        PEER URLS                CLIENT URLS              LEARNER
0.0.2     6457a4e8ecba5c61 controlplane-1  https://10.0.0.2:2380    https://10.0.0.2:2379    false
0.0.3     7d3c4c7e8f9a1b2c controlplane-2  https://10.0.0.3:2380    https://10.0.0.3:2379    false
0.0.4     8e4d5d8f9g0b2c3d controlplane-3  https://10.0.0.4:2380    https://10.0.0.4:2379    false

Checking etcd Status

View detailed status of etcd members:

talosctl --nodes 10.0.0.2,10.0.0.3,10.0.0.4 etcd status

Example output:

NODE       MEMBER           DB SIZE    IN USE           LEADER           RAFT INDEX  RAFT TERM  LEARNER
0.0.2   6457a4e8ecba5c61 25 MB      18 MB (72.00%)   6457a4e8ecba5c61 123456      12         false
0.0.3   7d3c4c7e8f9a1b2c 25 MB      18 MB (72.00%)   6457a4e8ecba5c61 123456      12         false
0.0.4   8e4d5d8f9g0b2c3d 25 MB      18 MB (72.00%)   6457a4e8ecba5c61 123456      12         false

The status shows:

DB SIZE: Total database size
IN USE: Actual data size (percentage used)
LEADER: Current etcd leader member ID
RAFT INDEX: Current Raft log index
RAFT TERM: Current Raft term

etcd Backup

Regular etcd backups are essential for disaster recovery. Talos provides a built-in snapshot mechanism.

Creating a Snapshot

Create an etcd snapshot and save it locally:

talosctl --nodes 10.0.0.2 etcd snapshot db.snapshot

Example output:

etcd snapshot saved to "db.snapshot" (25165824 bytes)
snapshot info: hash 12ab34cd, revision 123456, total keys 5234, total size 25165824

Take snapshots from a single etcd member. All members have the same data, so there’s no need to snapshot all nodes.

Automated Backup Script

Create a script to automate regular backups:

#!/bin/bash
BACKUP_DIR="/backups/etcd"
DATE=$(date +%Y%m%d-%H%M%S)
FILENAME="etcd-snapshot-${DATE}.db"

mkdir -p ${BACKUP_DIR}

talosctl --nodes 10.0.0.2 etcd snapshot ${BACKUP_DIR}/${FILENAME}

if [ $? -eq 0 ]; then
    echo "Backup successful: ${FILENAME}"
    # Keep only last 7 days of backups
    find ${BACKUP_DIR} -name "etcd-snapshot-*.db" -mtime +7 -delete
else
    echo "Backup failed!"
    exit 1
fi

Run this script via cron for automated backups:

# Daily backup at 2 AM
0 2 * * * /usr/local/bin/etcd-backup.sh

Verifying Snapshot Integrity

The snapshot command automatically verifies the checksum. The output includes the hash which can be used to verify integrity later.

etcd Restore

Restore etcd from a snapshot during disaster recovery.

Restoring etcd will replace all current cluster data. Only perform this operation if you need to recover from a catastrophic failure.

Bootstrap from Snapshot

Restore etcd cluster from a snapshot:

Prepare the snapshot

Ensure you have a valid etcd snapshot file:

ls -lh db.snapshot

Bootstrap with recovery

Bootstrap the first control plane node with the snapshot:

talosctl --nodes 10.0.0.2 bootstrap --recover-from=db.snapshot

Example output:

recovering from snapshot "db.snapshot": hash 12ab34cd, revision 123456, total keys 5234, total size 25165824

Wait for cluster recovery

Monitor the recovery process:

talosctl --nodes 10.0.0.2 etcd members
kubectl get nodes

Other control plane nodes will automatically rejoin the recovered etcd cluster.

Skip Hash Check (Advanced)

If recovering from a data directory copy instead of a proper snapshot:

talosctl --nodes 10.0.0.2 bootstrap \
  --recover-from=db.snapshot \
  --recover-skip-hash-check

Only use --recover-skip-hash-check when recovering from a data directory backup. This skips integrity verification.

etcd Maintenance

Defragmentation

etcd databases can become fragmented over time, wasting disk space. Defragmentation compacts the database.

talosctl --nodes 10.0.0.2 etcd defrag

Defragmentation is a resource-intensive operation. Only defragment one node at a time and during low-traffic periods.

Best practices for defragmentation:

Check current database usage:
```
talosctl --nodes 10.0.0.2 etcd status
```

Defragment if IN USE percentage is low (e.g., < 60%):

# Defragment each node sequentially
talosctl --nodes 10.0.0.2 etcd defrag
sleep 60
talosctl --nodes 10.0.0.3 etcd defrag
sleep 60
talosctl --nodes 10.0.0.4 etcd defrag

Verify database size reduction:

talosctl --nodes 10.0.0.2,10.0.0.3,10.0.0.4 etcd status

Managing etcd Alarms

etcd can trigger alarms when issues occur (e.g., storage quota exceeded). List active alarms:

talosctl --nodes 10.0.0.2 etcd alarm list

Example output:

NODE       MEMBER           ALARM
10.0.0.2   6457a4e8ecba5c61 NOSPACE

Disarm alarms (after resolving the underlying issue):

talosctl --nodes 10.0.0.2 etcd alarm disarm

Handling NOSPACE Alarm

If etcd runs out of space:

Identify the issue

talosctl --nodes 10.0.0.2 etcd status
talosctl --nodes 10.0.0.2 etcd alarm list

Free up space

Defragment the database:

talosctl --nodes 10.0.0.2 etcd defrag

Disarm the alarm

talosctl --nodes 10.0.0.2 etcd alarm disarm

Verify recovery

talosctl --nodes 10.0.0.2 etcd status

etcd Cluster Operations

Removing a Member

Gracefully remove a control plane node from etcd:

talosctl --nodes 10.0.0.3 etcd leave

This command:

Removes the node from etcd cluster
Notifies other members
Shuts down the local etcd instance

Force Remove a Member

If a node is completely unreachable, force remove it:

# Get member ID from etcd members list
talosctl --nodes 10.0.0.2 etcd members

# Remove by member ID
talosctl --nodes 10.0.0.2 etcd remove-member 7d3c4c7e8f9a1b2c

Only use remove-member when the node is permanently lost or cannot call etcd leave. Always prefer etcd leave over force removal.

Forfeit Leadership

Force an etcd member to give up leadership:

talosctl --nodes 10.0.0.2 etcd forfeit-leadership

Useful when:

Performing maintenance on the leader node
Testing leader election
Troubleshooting leader-specific issues

Monitoring etcd Health

Regular health monitoring helps prevent issues:

Health Check Script

#!/bin/bash

echo "=== etcd Member Status ==="
talosctl --nodes 10.0.0.2,10.0.0.3,10.0.0.4 etcd status

echo ""
echo "=== etcd Alarms ==="
talosctl --nodes 10.0.0.2 etcd alarm list

echo ""
echo "=== etcd Members ==="
talosctl --nodes 10.0.0.2 etcd members

# Check for unhealthy members
UNHEALTHY=$(talosctl --nodes 10.0.0.2,10.0.0.3,10.0.0.4 etcd status | grep -v "false" | wc -l)
if [ $UNHEALTHY -gt 3 ]; then
    echo "WARNING: Unhealthy etcd members detected!"
    exit 1
fi

echo "etcd cluster is healthy"

Key Metrics to Monitor

DB Size: Should grow predictably; sudden increases indicate issues
IN USE %: Values < 50% suggest fragmentation; consider defragmentation
Leader stability: Frequent leader changes indicate network or performance issues
Raft index lag: Members should have similar Raft indices; large differences indicate sync issues

etcd Performance Tuning

Database Size Management

Keep etcd database size under control:

Regular defragmentation: Run monthly or when IN USE < 60%
Compact history: Kubernetes automatically compacts old revisions
Limit object sizes: Avoid storing large objects in Kubernetes

Monitoring Best Practices

Take daily snapshots and store them off-cluster
Monitor database size trends
Set up alerts for:
- etcd alarms
- Database size > 8GB (consider resizing)
- Defragmentation needed (IN USE < 50%)
- Leader election changes

Troubleshooting etcd

Split Brain Scenario

If etcd members can’t reach consensus:

Check network connectivity between control plane nodes
Verify etcd member status
Check for conflicting member IDs
Restore from backup if necessary

Member Not Syncing

If a member falls behind:

# Check if member is learner or having issues
talosctl --nodes 10.0.0.3 etcd status

# Check logs for errors
talosctl --nodes 10.0.0.3 logs etcd

# If needed, remove and re-add the member
talosctl --nodes 10.0.0.3 etcd leave
# Re-apply the machine configuration
talosctl apply-config --nodes 10.0.0.3 --file controlplane.yaml

Get Started

Architecture

Installation & Deployment

Configuration

Operations

Security

etcd Cluster Overview

Viewing etcd Members

Checking etcd Status

etcd Backup

Creating a Snapshot

Automated Backup Script

Verifying Snapshot Integrity

etcd Restore

Bootstrap from Snapshot

Skip Hash Check (Advanced)

etcd Maintenance

Defragmentation

Managing etcd Alarms

Handling NOSPACE Alarm

etcd Cluster Operations

Removing a Member

Force Remove a Member

Forfeit Leadership

Monitoring etcd Health

Health Check Script

Key Metrics to Monitor

etcd Performance Tuning

Database Size Management

Monitoring Best Practices

Troubleshooting etcd

Split Brain Scenario

Member Not Syncing

Build docs developers (and LLMs) love

Get Started

Architecture

Installation & Deployment

Configuration

Operations

Security

​etcd Cluster Overview

​Viewing etcd Members

​Checking etcd Status

​etcd Backup

​Creating a Snapshot

​Automated Backup Script

​Verifying Snapshot Integrity

​etcd Restore

​Bootstrap from Snapshot

​Skip Hash Check (Advanced)

​etcd Maintenance

​Defragmentation

​Managing etcd Alarms

​Handling NOSPACE Alarm

​etcd Cluster Operations

​Removing a Member

​Force Remove a Member

​Forfeit Leadership

​Monitoring etcd Health

​Health Check Script

​Key Metrics to Monitor

​etcd Performance Tuning

​Database Size Management

​Monitoring Best Practices

​Troubleshooting etcd

​Split Brain Scenario

​Member Not Syncing

Build docs developers (and LLMs) love

etcd Cluster Overview

Viewing etcd Members

Checking etcd Status

etcd Backup

Creating a Snapshot

Automated Backup Script

Verifying Snapshot Integrity

etcd Restore

Bootstrap from Snapshot

Skip Hash Check (Advanced)

etcd Maintenance

Defragmentation

Managing etcd Alarms

Handling NOSPACE Alarm

etcd Cluster Operations

Removing a Member

Force Remove a Member

Forfeit Leadership

Monitoring etcd Health

Health Check Script

Key Metrics to Monitor

etcd Performance Tuning

Database Size Management

Monitoring Best Practices

Troubleshooting etcd

Split Brain Scenario

Member Not Syncing