Overview
Fluxer implements automated, encrypted backups for all critical data stores, ensuring data durability and enabling disaster recovery. The backup strategy includes:
Automated snapshots - Hourly Cassandra snapshots with retention policies
Encryption - Age public-key encryption for backup security
Off-site storage - Backblaze B2 object storage for geographic redundancy
Point-in-time recovery - Restore to any hourly snapshot within 7 days
Backups are only useful if you can restore them. Test your restore procedures regularly!
Backup Architecture
Data Stores
Cassandra
Valkey/Redis
NATS JetStream
Configuration
What’s backed up :
All keyspaces and tables (except system tables)
Schema definitions (CQL)
Cluster topology metadata
Backup method : Snapshot-based (nodetool snapshot)Frequency : HourlyRetention : 168 backups (7 days)What’s backed up :
RDB snapshots
AOF (Append-Only File) if enabled
Backup method : Copy RDB fileFrequency : Configurable (default: every 60 seconds if 1+ key changed)Retention : Local volume only (ephemeral cache)What’s backed up :
Stream data and metadata
Consumer state
Backup method : Volume snapshotsFrequency : DailyRetention : 7 daysWhat’s backed up :
Docker configs and secrets
Service configuration files
TLS certificates
Backup method : Git repository + encrypted archiveFrequency : On changeRetention : Indefinite (Git history)
Cassandra Backup System
Automated Backup Service
The cassandra-backup container runs hourly backups:
Create Snapshot
nodetool snapshot -t backup-20260304-103000
Creates immutable point-in-time snapshots of all SSTables.
Collect Snapshot Files
# Find all snapshot directories
find /var/lib/cassandra/data -type d -name "backup-20260304-103000"
# Copy to temporary directory
cp -r snapshots/ * /tmp/cassandra-backup-20260304-103000/
Export Schema
cqlsh -e "DESC SCHEMA;" > /tmp/cassandra-backup-20260304-103000/schema.cql
Saves table definitions, indexes, and materialized views.
Save Cluster Metadata
nodetool describecluster > cluster_topology.txt
nodetool status > cluster_status.txt
Compress and Encrypt
tar -cf - cassandra-backup-20260304-103000 | \
age -r age1xxxxxx... -o backup.tar.age
Uses age for public-key encryption.
Upload to B2
aws s3 cp backup.tar.age \
s3://fluxer-cassandra-backups/cassandra-backup-20260304-103000.tar.age \
--endpoint-url=https://s3.us-west-002.backblazeb2.com
Cleanup
# Remove local encrypted backup
rm -f backup.tar.age
# Clear snapshot from Cassandra
nodetool clearsnapshot -t backup-20260304-103000
# Purge old backups (keep 168)
# Deletes backups older than 7 days from B2
Backup Script
The backup process is automated via fluxer_devops/cassandra/backup.sh:
Manual Trigger
View Backup Logs
List Backups
# Run backup manually
docker exec cassandra-backup /backup.sh
Environment Configuration
Configure backup settings in .env:
# Age encryption (REQUIRED)
AGE_PUBLIC_KEY = age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AGE_PUBLIC_KEY_FILE = /tmp/age_public_key.txt
# Backblaze B2 (REQUIRED for off-site backups)
B2_KEY_ID = your-b2-key-id
B2_APPLICATION_KEY = your-b2-application-key
B2_BUCKET_NAME = fluxer-cassandra-backups
B2_ENDPOINT = s3.us-west-002.backblazeb2.com
B2_REGION = us-west-002
# Cassandra connection
CASSANDRA_HOST = cassandra
CASSANDRA_PASSWORD = your-cassandra-password
# Retention (optional, default: 168 = 7 days)
MAX_BACKUP_COUNT = 168
Without age encryption configured, backups are stored locally but NOT uploaded to B2. This is useful for development but not recommended for production.
Encryption Setup
Generate Age Key Pair
Install age
# Debian/Ubuntu
apt install age
# macOS
brew install age
# Arch Linux
pacman -S age
Generate Keypair
age-keygen -o age_private_key.txt
Output: # created: 2026-03-04T10:30:00Z
# public key: age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AGE-SECRET-KEY-1XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Store Private Key Securely
The private key is the ONLY way to decrypt backups. Losing it means permanent data loss!
Store in multiple secure locations:
Password manager (1Password, Bitwarden, etc.)
Hardware security key (YubiKey with age-plugin-yubikey)
Encrypted USB drive in safe deposit box
Printed paper backup in fireproof safe
Configure Public Key
Add to .env: AGE_PUBLIC_KEY = age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Commit to Git (public keys are safe to store in version control).
Test Encryption
Verify encryption/decryption works:
# Create test file
echo "test data" > test.txt
# Encrypt with public key
age -r age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
-o test.txt.age test.txt
# Decrypt with private key
age -d -i age_private_key.txt test.txt.age
# Should output: test data
Backblaze B2 Setup
Create Bucket
Bucket Name: fluxer-cassandra-backups
Files in Bucket: Private
Encryption: Disabled (backups are already encrypted with age )
Object Lock: Disabled
Lifecycle Settings: None (handled by backup script )
Create Application Key
Navigate to App Keys → Add a New Application Key: Name: fluxer-cassandra-backup
Access: Read and Write
Bucket: fluxer-cassandra-backups
File name prefix: (leave empty )
Duration: (leave empty for no expiration )
Save the Key ID and Application Key (shown only once!).
Configure Credentials
Add to .env: B2_KEY_ID = your_key_id_here
B2_APPLICATION_KEY = your_application_key_here
B2_BUCKET_NAME = fluxer-cassandra-backups
B2_ENDPOINT = s3.us-west-002.backblazeb2.com
B2_REGION = us-west-002
B2 Lifecycle Rules
Optional: Configure B2 lifecycle rules for additional retention control:
{
"daysFromHidingToDeleting" : 1 ,
"daysFromUploadingToHiding" : 7 ,
"fileNamePrefix" : "cassandra-backup-"
}
This ensures backups are automatically deleted 7 days after upload, even if the backup script fails to purge them.
Restore Procedures
Local Development Restore
Restore a backup to a local Cassandra instance:
#!/bin/bash
set -eu
# 1. Create fresh Cassandra instance
docker volume create cassandra_data
docker run -d --name cass \
-v cassandra_data:/var/lib/cassandra \
-p 9042:9042 \
cassandra:5.0
echo "Waiting for Cassandra to start..."
sleep 30
# 2. Install age and copy backup
docker exec cass sh -c 'apt-get update -qq && apt-get install -y -qq age'
docker cp ~/Downloads/cassandra-backup-20260304-103000.tar.age cass:/tmp/backup.tar.age
docker cp ~/Downloads/age_private_key.txt cass:/tmp/key.txt
# 3. Decrypt and extract
docker exec cass sh -c \
'age -d -i /tmp/key.txt /tmp/backup.tar.age | tar -C /tmp -xf -'
# 4. Apply schema
docker exec cass sh -c \
'sed "/^WARNING:/d" /tmp/cassandra-backup-*/schema.cql | cqlsh'
echo "Schema applied. Stopping Cassandra to restore SSTables..."
# 5. Stop Cassandra and restore files
docker stop cass
docker run -d --name cass-util \
-v cassandra_data:/var/lib/cassandra \
--entrypoint sleep \
cassandra:5.0 infinity
# Copy SSTable files to data directories
docker exec cass-util sh -c '
BACKUP_DIR=$(ls -d /var/lib/cassandra/cassandra-backup-* | head -1)
DATA_DIR=/var/lib/cassandra/data
for keyspace_dir in "$BACKUP_DIR"/*/; do
keyspace=$(basename "$keyspace_dir")
[[ "$keyspace" =~ ^system ]] && continue
[ ! -d "$keyspace_dir" ] && continue
for snapshot_dir in "$keyspace_dir"/*/snapshots/backup-*/; do
[ ! -d "$snapshot_dir" ] && continue
table_with_uuid=$(basename $(dirname $(dirname "$snapshot_dir")))
table_name=$(echo "$table_with_uuid" | cut -d- -f1)
target_dir=$(ls -d "$DATA_DIR/$keyspace/${table_name}"-* 2>/dev/null | head -1)
if [ -n "$target_dir" ]; then
echo "Restoring $keyspace.$table_name"
cp "$snapshot_dir"/* "$target_dir"/ 2>/dev/null || true
fi
done
done
chown -R cassandra:cassandra "$DATA_DIR"
'
# 6. Restart Cassandra
docker rm -f cass-util
docker start cass
echo "Waiting for Cassandra to restart..."
sleep 30
# 7. Refresh tables to load restored SSTables
docker exec cass sh -c '
BACKUP_DIR=$(ls -d /var/lib/cassandra/cassandra-backup-* | head -1)
for keyspace_dir in "$BACKUP_DIR"/*/; do
keyspace=$(basename "$keyspace_dir")
[[ "$keyspace" =~ ^system ]] && continue
for snapshot_dir in "$keyspace_dir"/*/snapshots/backup-*/; do
[ ! -d "$snapshot_dir" ] && continue
table_with_uuid=$(basename $(dirname $(dirname "$snapshot_dir")))
table_name=$(echo "$table_with_uuid" | cut -d- -f1)
echo "Refreshing $keyspace.$table_name"
nodetool refresh -- "$keyspace" "$table_name" 2>&1 | grep -v deprecated || true
done
done
'
# 8. Verify
echo "Verifying restore..."
docker exec cass cqlsh -e "SELECT COUNT(*) FROM fluxer.users;"
docker exec cass cqlsh -e "SELECT COUNT(*) FROM fluxer.messages;"
echo "Restore complete!"
Production Restore from B2
Restore a production backup from Backblaze B2:
#!/bin/bash
set -eu
# Configuration
BACKUP_NAME = "cassandra-backup-20260304-103000.tar.age"
CASSANDRA_CONTAINER = "cassandra-prod"
AGE_PRIVATE_KEY_FILE = "/secure/age_private_key.txt"
# B2 credentials
export AWS_ACCESS_KEY_ID = "${ B2_KEY_ID }"
export AWS_SECRET_ACCESS_KEY = "${ B2_APPLICATION_KEY }"
export AWS_DEFAULT_REGION = "${ B2_REGION }"
B2_ENDPOINT_URL = "https://${ B2_ENDPOINT }"
echo "[1/9] Downloading backup from B2..."
aws s3 cp "s3://${ B2_BUCKET_NAME }/${ BACKUP_NAME }" \
"/tmp/${ BACKUP_NAME }" \
--endpoint-url= "${ B2_ENDPOINT_URL }"
echo "[2/9] Copying backup to Cassandra container..."
docker cp "/tmp/${ BACKUP_NAME }" ${ CASSANDRA_CONTAINER } :/tmp/
docker cp "${ AGE_PRIVATE_KEY_FILE }" ${ CASSANDRA_CONTAINER } :/tmp/key.txt
echo "[3/9] Installing age in container..."
docker exec ${ CASSANDRA_CONTAINER } sh -c \
'apt-get update -qq && apt-get install -y -qq age'
echo "[4/9] Stopping Cassandra..."
docker stop ${ CASSANDRA_CONTAINER }
echo "[5/9] Extracting backup..."
docker run -d --name cass-restore-util \
--volumes-from ${ CASSANDRA_CONTAINER } \
--entrypoint sleep \
cassandra:5.0 infinity
docker exec cass-restore-util sh -c \
"age -d -i /tmp/key.txt /tmp/${ BACKUP_NAME } | tar -C /tmp -xf -"
echo "[6/9] Restoring SSTable files..."
docker exec cass-restore-util sh -c '
BACKUP_DIR=$(ls -d /tmp/cassandra-backup-* | head -1)
DATA_DIR=/var/lib/cassandra/data
# Remove existing data (DESTRUCTIVE!)
echo "WARNING: Removing existing data..."
rm -rf "$DATA_DIR/fluxer"
# Restore from backup
for keyspace_dir in "$BACKUP_DIR"/*/; do
keyspace=$(basename "$keyspace_dir")
[[ "$keyspace" =~ ^system ]] && continue
for table_dir in "$keyspace_dir"/*/; do
[ ! -d "$table_dir" ] && continue
table_with_uuid=$(basename "$table_dir")
mkdir -p "$DATA_DIR/$keyspace/$table_with_uuid"
for snapshot_dir in "$table_dir"/snapshots/backup-*/; do
[ ! -d "$snapshot_dir" ] && continue
cp -v "$snapshot_dir"/* "$DATA_DIR/$keyspace/$table_with_uuid/" || true
done
done
done
chown -R cassandra:cassandra "$DATA_DIR"
'
echo "[7/9] Restarting Cassandra..."
docker rm -f cass-restore-util
docker start ${ CASSANDRA_CONTAINER }
sleep 60
echo "[8/9] Refreshing tables..."
docker exec ${ CASSANDRA_CONTAINER } sh -c '
for keyspace in $(echo "DESC KEYSPACES;" | cqlsh | grep -v "^system"); do
for table in $(echo "DESC TABLES;" | cqlsh -k "$keyspace"); do
echo "Refreshing $keyspace.$table"
nodetool refresh -- "$keyspace" "$table" 2>&1 | grep -v deprecated || true
done
done
'
echo "[9/9] Verifying restore..."
docker exec ${ CASSANDRA_CONTAINER } cqlsh -e \
"SELECT keyspace_name, COUNT(*) FROM system_schema.tables GROUP BY keyspace_name;"
echo "Cleanup..."
rm -f "/tmp/${ BACKUP_NAME }"
echo "Restore complete!"
Production restores are destructive and will delete existing data. Always test in a staging environment first!
Disaster Recovery Plan
Recovery Time Objective (RTO)
Target : 4 hours from disaster declaration to service restoration
Declare Incident (T+0)
Notify team via on-call system
Create incident channel
Assign incident commander
Assess Damage (T+30min)
Identify affected services
Determine data loss extent
Select recovery strategy
Provision Infrastructure (T+1h)
Deploy new servers if needed
Restore network configuration
Configure firewall rules
Restore Data (T+2.5h)
Download latest backup from B2
Decrypt and verify backup integrity
Restore to new Cassandra cluster
Verify and Test (T+3.5h)
Run data integrity checks
Test critical user flows
Perform smoke tests
Resume Service (T+4h)
Update DNS records
Announce restoration
Monitor for issues
Recovery Point Objective (RPO)
Target : 1 hour maximum data loss
Hourly Cassandra backups
Latest backup is at most 59 minutes old
Additional protection via replication (if multi-node cluster)
Backup Monitoring
Alerts
Set up monitoring for backup health:
alerts :
- name : Backup Failed
condition : time_since_last_backup > 2h
severity : critical
notification : pagerduty
- name : Backup Size Anomaly
condition : backup_size > (avg_backup_size * 1.5)
severity : warning
notification : slack
- name : B2 Upload Failed
condition : b2_upload_error_count > 0
severity : critical
notification : pagerduty
Verification
Regularly verify backups are restorable:
#!/bin/bash
# Weekly backup verification job
# 1. Download random recent backup
BACKUP = $( aws s3 ls s3://fluxer-cassandra-backups/ | \
grep cassandra-backup | \
tail -10 | \
shuf -n 1 | \
awk '{print $4}' )
# 2. Attempt restore to test environment
./restore-to-test.sh " $BACKUP "
# 3. Run verification queries
docker exec cassandra-test cqlsh -e "SELECT COUNT(*) FROM fluxer.users;"
# 4. Report results
if [ $? -eq 0 ]; then
echo "✓ Backup verification passed: $BACKUP "
else
echo "✗ Backup verification FAILED: $BACKUP " | \
slack-send --channel '#alerts'
fi
# 5. Cleanup
docker stop cassandra-test
docker rm cassandra-test
Best Practices
Test Restores Regularly Schedule monthly disaster recovery drills to practice restore procedures and verify backup integrity.
Store Keys Securely Keep multiple copies of age private key in geographically distributed secure locations.
Monitor Backup Size Track backup size trends to detect data growth or corruption early.
Document Procedures Keep runbooks updated with screenshots and recent examples for on-call engineers.
See Also