Skip to main content

Overview

Fluxer implements automated, encrypted backups for all critical data stores, ensuring data durability and enabling disaster recovery. The backup strategy includes:
  • Automated snapshots - Hourly Cassandra snapshots with retention policies
  • Encryption - Age public-key encryption for backup security
  • Off-site storage - Backblaze B2 object storage for geographic redundancy
  • Point-in-time recovery - Restore to any hourly snapshot within 7 days
Backups are only useful if you can restore them. Test your restore procedures regularly!

Backup Architecture

Data Stores

What’s backed up:
  • All keyspaces and tables (except system tables)
  • Schema definitions (CQL)
  • Cluster topology metadata
Backup method: Snapshot-based (nodetool snapshot)Frequency: HourlyRetention: 168 backups (7 days)

Cassandra Backup System

Automated Backup Service

The cassandra-backup container runs hourly backups:
1

Create Snapshot

nodetool snapshot -t backup-20260304-103000
Creates immutable point-in-time snapshots of all SSTables.
2

Collect Snapshot Files

# Find all snapshot directories
find /var/lib/cassandra/data -type d -name "backup-20260304-103000"

# Copy to temporary directory
cp -r snapshots/* /tmp/cassandra-backup-20260304-103000/
3

Export Schema

cqlsh -e "DESC SCHEMA;" > /tmp/cassandra-backup-20260304-103000/schema.cql
Saves table definitions, indexes, and materialized views.
4

Save Cluster Metadata

nodetool describecluster > cluster_topology.txt
nodetool status > cluster_status.txt
5

Compress and Encrypt

tar -cf - cassandra-backup-20260304-103000 | \
  age -r age1xxxxxx... -o backup.tar.age
Uses age for public-key encryption.
6

Upload to B2

aws s3 cp backup.tar.age \
  s3://fluxer-cassandra-backups/cassandra-backup-20260304-103000.tar.age \
  --endpoint-url=https://s3.us-west-002.backblazeb2.com
7

Cleanup

# Remove local encrypted backup
rm -f backup.tar.age

# Clear snapshot from Cassandra
nodetool clearsnapshot -t backup-20260304-103000

# Purge old backups (keep 168)
# Deletes backups older than 7 days from B2

Backup Script

The backup process is automated via fluxer_devops/cassandra/backup.sh:
# Run backup manually
docker exec cassandra-backup /backup.sh

Environment Configuration

Configure backup settings in .env:
.env
# Age encryption (REQUIRED)
AGE_PUBLIC_KEY=age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AGE_PUBLIC_KEY_FILE=/tmp/age_public_key.txt

# Backblaze B2 (REQUIRED for off-site backups)
B2_KEY_ID=your-b2-key-id
B2_APPLICATION_KEY=your-b2-application-key
B2_BUCKET_NAME=fluxer-cassandra-backups
B2_ENDPOINT=s3.us-west-002.backblazeb2.com
B2_REGION=us-west-002

# Cassandra connection
CASSANDRA_HOST=cassandra
CASSANDRA_PASSWORD=your-cassandra-password

# Retention (optional, default: 168 = 7 days)
MAX_BACKUP_COUNT=168
Without age encryption configured, backups are stored locally but NOT uploaded to B2. This is useful for development but not recommended for production.

Encryption Setup

Generate Age Key Pair

1

Install age

# Debian/Ubuntu
apt install age

# macOS
brew install age

# Arch Linux
pacman -S age
2

Generate Keypair

age-keygen -o age_private_key.txt
Output:
# created: 2026-03-04T10:30:00Z
# public key: age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AGE-SECRET-KEY-1XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
3

Store Private Key Securely

The private key is the ONLY way to decrypt backups. Losing it means permanent data loss!
Store in multiple secure locations:
  • Password manager (1Password, Bitwarden, etc.)
  • Hardware security key (YubiKey with age-plugin-yubikey)
  • Encrypted USB drive in safe deposit box
  • Printed paper backup in fireproof safe
4

Configure Public Key

Add to .env:
AGE_PUBLIC_KEY=age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Commit to Git (public keys are safe to store in version control).

Test Encryption

Verify encryption/decryption works:
# Create test file
echo "test data" > test.txt

# Encrypt with public key
age -r age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
  -o test.txt.age test.txt

# Decrypt with private key
age -d -i age_private_key.txt test.txt.age

# Should output: test data

Backblaze B2 Setup

1

Create B2 Account

Sign up at backblaze.com/b2
2

Create Bucket

Bucket Name: fluxer-cassandra-backups
Files in Bucket: Private
Encryption: Disabled (backups are already encrypted with age)
Object Lock: Disabled
Lifecycle Settings: None (handled by backup script)
3

Create Application Key

Navigate to App Keys → Add a New Application Key:
Name: fluxer-cassandra-backup
Access: Read and Write
Bucket: fluxer-cassandra-backups
File name prefix: (leave empty)
Duration: (leave empty for no expiration)
Save the Key ID and Application Key (shown only once!).
4

Configure Credentials

Add to .env:
B2_KEY_ID=your_key_id_here
B2_APPLICATION_KEY=your_application_key_here
B2_BUCKET_NAME=fluxer-cassandra-backups
B2_ENDPOINT=s3.us-west-002.backblazeb2.com
B2_REGION=us-west-002

B2 Lifecycle Rules

Optional: Configure B2 lifecycle rules for additional retention control:
{
  "daysFromHidingToDeleting": 1,
  "daysFromUploadingToHiding": 7,
  "fileNamePrefix": "cassandra-backup-"
}
This ensures backups are automatically deleted 7 days after upload, even if the backup script fails to purge them.

Restore Procedures

Local Development Restore

Restore a backup to a local Cassandra instance:
#!/bin/bash
set -eu

# 1. Create fresh Cassandra instance
docker volume create cassandra_data
docker run -d --name cass \
  -v cassandra_data:/var/lib/cassandra \
  -p 9042:9042 \
  cassandra:5.0

echo "Waiting for Cassandra to start..."
sleep 30

# 2. Install age and copy backup
docker exec cass sh -c 'apt-get update -qq && apt-get install -y -qq age'
docker cp ~/Downloads/cassandra-backup-20260304-103000.tar.age cass:/tmp/backup.tar.age
docker cp ~/Downloads/age_private_key.txt cass:/tmp/key.txt

# 3. Decrypt and extract
docker exec cass sh -c \
  'age -d -i /tmp/key.txt /tmp/backup.tar.age | tar -C /tmp -xf -'

# 4. Apply schema
docker exec cass sh -c \
  'sed "/^WARNING:/d" /tmp/cassandra-backup-*/schema.cql | cqlsh'

echo "Schema applied. Stopping Cassandra to restore SSTables..."

# 5. Stop Cassandra and restore files
docker stop cass

docker run -d --name cass-util \
  -v cassandra_data:/var/lib/cassandra \
  --entrypoint sleep \
  cassandra:5.0 infinity

# Copy SSTable files to data directories
docker exec cass-util sh -c '
  BACKUP_DIR=$(ls -d /var/lib/cassandra/cassandra-backup-* | head -1)
  DATA_DIR=/var/lib/cassandra/data
  
  for keyspace_dir in "$BACKUP_DIR"/*/; do
    keyspace=$(basename "$keyspace_dir")
    [[ "$keyspace" =~ ^system ]] && continue
    [ ! -d "$keyspace_dir" ] && continue
    
    for snapshot_dir in "$keyspace_dir"/*/snapshots/backup-*/; do
      [ ! -d "$snapshot_dir" ] && continue
      table_with_uuid=$(basename $(dirname $(dirname "$snapshot_dir")))
      table_name=$(echo "$table_with_uuid" | cut -d- -f1)
      target_dir=$(ls -d "$DATA_DIR/$keyspace/${table_name}"-* 2>/dev/null | head -1)
      
      if [ -n "$target_dir" ]; then
        echo "Restoring $keyspace.$table_name"
        cp "$snapshot_dir"/* "$target_dir"/ 2>/dev/null || true
      fi
    done
  done
  
  chown -R cassandra:cassandra "$DATA_DIR"
'

# 6. Restart Cassandra
docker rm -f cass-util
docker start cass
echo "Waiting for Cassandra to restart..."
sleep 30

# 7. Refresh tables to load restored SSTables
docker exec cass sh -c '
  BACKUP_DIR=$(ls -d /var/lib/cassandra/cassandra-backup-* | head -1)
  
  for keyspace_dir in "$BACKUP_DIR"/*/; do
    keyspace=$(basename "$keyspace_dir")
    [[ "$keyspace" =~ ^system ]] && continue
    
    for snapshot_dir in "$keyspace_dir"/*/snapshots/backup-*/; do
      [ ! -d "$snapshot_dir" ] && continue
      table_with_uuid=$(basename $(dirname $(dirname "$snapshot_dir")))
      table_name=$(echo "$table_with_uuid" | cut -d- -f1)
      echo "Refreshing $keyspace.$table_name"
      nodetool refresh -- "$keyspace" "$table_name" 2>&1 | grep -v deprecated || true
    done
  done
'

# 8. Verify
echo "Verifying restore..."
docker exec cass cqlsh -e "SELECT COUNT(*) FROM fluxer.users;"
docker exec cass cqlsh -e "SELECT COUNT(*) FROM fluxer.messages;"

echo "Restore complete!"

Production Restore from B2

Restore a production backup from Backblaze B2:
#!/bin/bash
set -eu

# Configuration
BACKUP_NAME="cassandra-backup-20260304-103000.tar.age"
CASSANDRA_CONTAINER="cassandra-prod"
AGE_PRIVATE_KEY_FILE="/secure/age_private_key.txt"

# B2 credentials
export AWS_ACCESS_KEY_ID="${B2_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${B2_APPLICATION_KEY}"
export AWS_DEFAULT_REGION="${B2_REGION}"
B2_ENDPOINT_URL="https://${B2_ENDPOINT}"

echo "[1/9] Downloading backup from B2..."
aws s3 cp "s3://${B2_BUCKET_NAME}/${BACKUP_NAME}" \
  "/tmp/${BACKUP_NAME}" \
  --endpoint-url="${B2_ENDPOINT_URL}"

echo "[2/9] Copying backup to Cassandra container..."
docker cp "/tmp/${BACKUP_NAME}" ${CASSANDRA_CONTAINER}:/tmp/
docker cp "${AGE_PRIVATE_KEY_FILE}" ${CASSANDRA_CONTAINER}:/tmp/key.txt

echo "[3/9] Installing age in container..."
docker exec ${CASSANDRA_CONTAINER} sh -c \
  'apt-get update -qq && apt-get install -y -qq age'

echo "[4/9] Stopping Cassandra..."
docker stop ${CASSANDRA_CONTAINER}

echo "[5/9] Extracting backup..."
docker run -d --name cass-restore-util \
  --volumes-from ${CASSANDRA_CONTAINER} \
  --entrypoint sleep \
  cassandra:5.0 infinity

docker exec cass-restore-util sh -c \
  "age -d -i /tmp/key.txt /tmp/${BACKUP_NAME} | tar -C /tmp -xf -"

echo "[6/9] Restoring SSTable files..."
docker exec cass-restore-util sh -c '
  BACKUP_DIR=$(ls -d /tmp/cassandra-backup-* | head -1)
  DATA_DIR=/var/lib/cassandra/data
  
  # Remove existing data (DESTRUCTIVE!)
  echo "WARNING: Removing existing data..."
  rm -rf "$DATA_DIR/fluxer"
  
  # Restore from backup
  for keyspace_dir in "$BACKUP_DIR"/*/; do
    keyspace=$(basename "$keyspace_dir")
    [[ "$keyspace" =~ ^system ]] && continue
    
    for table_dir in "$keyspace_dir"/*/; do
      [ ! -d "$table_dir" ] && continue
      table_with_uuid=$(basename "$table_dir")
      
      mkdir -p "$DATA_DIR/$keyspace/$table_with_uuid"
      
      for snapshot_dir in "$table_dir"/snapshots/backup-*/; do
        [ ! -d "$snapshot_dir" ] && continue
        cp -v "$snapshot_dir"/* "$DATA_DIR/$keyspace/$table_with_uuid/" || true
      done
    done
  done
  
  chown -R cassandra:cassandra "$DATA_DIR"
'

echo "[7/9] Restarting Cassandra..."
docker rm -f cass-restore-util
docker start ${CASSANDRA_CONTAINER}
sleep 60

echo "[8/9] Refreshing tables..."
docker exec ${CASSANDRA_CONTAINER} sh -c '
  for keyspace in $(echo "DESC KEYSPACES;" | cqlsh | grep -v "^system"); do
    for table in $(echo "DESC TABLES;" | cqlsh -k "$keyspace"); do
      echo "Refreshing $keyspace.$table"
      nodetool refresh -- "$keyspace" "$table" 2>&1 | grep -v deprecated || true
    done
  done
'

echo "[9/9] Verifying restore..."
docker exec ${CASSANDRA_CONTAINER} cqlsh -e \
  "SELECT keyspace_name, COUNT(*) FROM system_schema.tables GROUP BY keyspace_name;"

echo "Cleanup..."
rm -f "/tmp/${BACKUP_NAME}"

echo "Restore complete!"
Production restores are destructive and will delete existing data. Always test in a staging environment first!

Disaster Recovery Plan

Recovery Time Objective (RTO)

Target: 4 hours from disaster declaration to service restoration
1

Declare Incident (T+0)

  • Notify team via on-call system
  • Create incident channel
  • Assign incident commander
2

Assess Damage (T+30min)

  • Identify affected services
  • Determine data loss extent
  • Select recovery strategy
3

Provision Infrastructure (T+1h)

  • Deploy new servers if needed
  • Restore network configuration
  • Configure firewall rules
4

Restore Data (T+2.5h)

  • Download latest backup from B2
  • Decrypt and verify backup integrity
  • Restore to new Cassandra cluster
5

Verify and Test (T+3.5h)

  • Run data integrity checks
  • Test critical user flows
  • Perform smoke tests
6

Resume Service (T+4h)

  • Update DNS records
  • Announce restoration
  • Monitor for issues

Recovery Point Objective (RPO)

Target: 1 hour maximum data loss
  • Hourly Cassandra backups
  • Latest backup is at most 59 minutes old
  • Additional protection via replication (if multi-node cluster)

Backup Monitoring

Alerts

Set up monitoring for backup health:
alerts:
  - name: Backup Failed
    condition: time_since_last_backup > 2h
    severity: critical
    notification: pagerduty
  
  - name: Backup Size Anomaly
    condition: backup_size > (avg_backup_size * 1.5)
    severity: warning
    notification: slack
  
  - name: B2 Upload Failed
    condition: b2_upload_error_count > 0
    severity: critical
    notification: pagerduty

Verification

Regularly verify backups are restorable:
#!/bin/bash
# Weekly backup verification job

# 1. Download random recent backup
BACKUP=$(aws s3 ls s3://fluxer-cassandra-backups/ | \
  grep cassandra-backup | \
  tail -10 | \
  shuf -n 1 | \
  awk '{print $4}')

# 2. Attempt restore to test environment
./restore-to-test.sh "$BACKUP"

# 3. Run verification queries
docker exec cassandra-test cqlsh -e "SELECT COUNT(*) FROM fluxer.users;"

# 4. Report results
if [ $? -eq 0 ]; then
  echo "✓ Backup verification passed: $BACKUP"
else
  echo "✗ Backup verification FAILED: $BACKUP" | \
    slack-send --channel '#alerts'
fi

# 5. Cleanup
docker stop cassandra-test
docker rm cassandra-test

Best Practices

Test Restores Regularly

Schedule monthly disaster recovery drills to practice restore procedures and verify backup integrity.

Store Keys Securely

Keep multiple copies of age private key in geographically distributed secure locations.

Monitor Backup Size

Track backup size trends to detect data growth or corruption early.

Document Procedures

Keep runbooks updated with screenshots and recent examples for on-call engineers.

See Also

Build docs developers (and LLMs) love