Backup & Restore - vCluster

Overview

Backup and restore operations ensure business continuity and data protection for your virtual clusters. vCluster provides snapshot-based backup and restore capabilities that capture the complete state of your virtual environment.

Understanding Backup vs Snapshot

While closely related, these terms have distinct meanings:

Snapshot

A point-in-time capture of cluster state, typically short-term and used for quick recovery or cloning.

Backup

A snapshot intended for long-term retention, disaster recovery, and compliance purposes.

In practice, both use the same snapshot mechanism - the difference is in retention policy and purpose.

Creating Backups

Full Backup (Resources Only)

Create a backup of all Kubernetes resources:

vcluster snapshot create my-vcluster \
  oci://ghcr.io/my-org/backups:my-vcluster-full-$(date +%Y%m%d-%H%M%S) \
  --namespace production

This captures:

All Kubernetes resources (Deployments, StatefulSets, Services, etc.)
ConfigMaps and Secrets
Custom Resource Definitions and custom resources
RBAC policies and service accounts
Network policies

Full Backup (Including Volumes)

Create a complete backup including persistent volume data:

vcluster snapshot create my-vcluster \
  oci://ghcr.io/my-org/backups:my-vcluster-complete-$(date +%Y%m%d-%H%M%S) \
  --namespace production \
  --include-volumes

Volume snapshots require CSI driver support and VolumeSnapshot CRDs installed in the host cluster.

Restoring from Backup

Basic Restore

Restore a virtual cluster from a backup:

vcluster restore my-vcluster \
  oci://ghcr.io/my-org/backups:my-vcluster-full-20240115-140530 \
  --namespace production

Prepare Virtual Cluster

Ensure the target virtual cluster exists and is running:

vcluster list --namespace production

If it doesn’t exist, create it:

vcluster create my-vcluster --namespace production

Execute Restore

Run the restore command:

vcluster restore my-vcluster \
  oci://ghcr.io/my-org/backups:my-vcluster-full-20240115-140530 \
  --namespace production

Verify Restoration

Connect to the virtual cluster and verify resources:

vcluster connect my-vcluster --namespace production
kubectl get all --all-namespaces

Restore with Volumes

Restore including persistent volume data:

vcluster restore my-vcluster \
  oci://ghcr.io/my-org/backups:my-vcluster-complete-20240115-140530 \
  --namespace production \
  --restore-volumes

Important considerations:

Existing data in PVCs will be overwritten
Volume restoration can take significant time for large volumes
Storage class must support volume snapshots

Backup Strategies

Strategy 1: Scheduled Backups

Implement automated daily backups using CronJobs:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: vcluster-daily-backup
  namespace: backup-system
spec:
  # Run daily at 2 AM
  schedule: "0 2 * * *"
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: vcluster-backup-sa
          containers:
          - name: backup
            image: loftsh/vcluster-cli:latest
            env:
            - name: TIMESTAMP
              value: $(date +%Y%m%d-%H%M%S)
            command:
            - /bin/sh
            - -c
            - |
              echo "Starting backup at $(date)"
              
              # Backup production vcluster
              vcluster snapshot create production-vcluster \
                oci://ghcr.io/my-org/backups:production-${TIMESTAMP} \
                --namespace production
              
              # Backup staging vcluster
              vcluster snapshot create staging-vcluster \
                oci://ghcr.io/my-org/backups:staging-${TIMESTAMP} \
                --namespace staging
              
              echo "Backup completed at $(date)"
          restartPolicy: OnFailure

Strategy 2: Pre-Change Backups

Create backups before significant changes:

#!/bin/bash
# pre-deployment-backup.sh

VCLUSTER_NAME="production-vcluster"
NAMESPACE="production"
BACKUP_REPO="oci://ghcr.io/my-org/backups"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)

echo "Creating pre-deployment backup..."
vcluster snapshot create ${VCLUSTER_NAME} \
  ${BACKUP_REPO}:${VCLUSTER_NAME}-pre-deploy-${TIMESTAMP} \
  --namespace ${NAMESPACE}

if [ $? -eq 0 ]; then
  echo "Backup successful: ${VCLUSTER_NAME}-pre-deploy-${TIMESTAMP}"
  echo "Proceeding with deployment..."
  # Run deployment
  kubectl apply -f deployment.yaml
else
  echo "Backup failed! Aborting deployment."
  exit 1
fi

Strategy 3: 3-2-1 Backup Strategy

Implement the industry-standard 3-2-1 rule:

3 copies of your data
2 different storage types
1 offsite copy

#!/bin/bash
# 3-2-1-backup.sh

VCLUSTER_NAME="production-vcluster"
NAMESPACE="production"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)

# Copy 1: Primary storage (OCI registry)
vcluster snapshot create ${VCLUSTER_NAME} \
  oci://ghcr.io/my-org/backups:${VCLUSTER_NAME}-${TIMESTAMP} \
  --namespace ${NAMESPACE}

# Copy 2: Secondary storage (S3)
vcluster snapshot create ${VCLUSTER_NAME} \
  s3://my-backup-bucket/vclusters/${VCLUSTER_NAME}-${TIMESTAMP}.tar.gz \
  --namespace ${NAMESPACE}

# Copy 3: Offsite (different region or provider)
vcluster snapshot create ${VCLUSTER_NAME} \
  s3://my-dr-bucket/vclusters/${VCLUSTER_NAME}-${TIMESTAMP}.tar.gz \
  --namespace ${NAMESPACE}

Recovery Scenarios

Scenario 1: Accidental Deletion

Problem: A critical deployment was accidentally deleted.Solution: Targeted Restore

List recent backups:

# Use your registry's CLI to list backups
docker images ghcr.io/my-org/backups

Find the most recent backup before the deletion

Restore to a temporary cluster:

vcluster create recovery-temp --namespace recovery
vcluster restore recovery-temp \
  oci://ghcr.io/my-org/backups:production-20240115-140530 \
  --namespace recovery

Extract just the needed resources:

vcluster connect recovery-temp --namespace recovery
kubectl get deployment critical-app -o yaml > critical-app.yaml

Restore to production:

vcluster connect production-vcluster --namespace production
kubectl apply -f critical-app.yaml

Clean up temporary cluster:

vcluster delete recovery-temp --namespace recovery

Scenario 2: Complete Cluster Loss

Problem: The entire host cluster or virtual cluster is lost.Solution: Full Restore

Set up new host cluster (if needed)

Install vCluster:

helm repo add loft https://charts.loft.sh
helm repo update

Create new virtual cluster:

vcluster create production-vcluster \
  --namespace production \
  --create-namespace

Restore from most recent backup:

vcluster restore production-vcluster \
  oci://ghcr.io/my-org/backups:production-latest \
  --namespace production \
  --restore-volumes

Verify all services are running:

vcluster connect production-vcluster --namespace production
kubectl get all --all-namespaces
kubectl get pvc --all-namespaces

Update DNS and external access if needed

Scenario 3: Rollback After Failed Update

Problem: A Kubernetes version upgrade or major change caused issues.Solution: Version Rollback

Identify the pre-upgrade backup:

# Should have been created before upgrade
BACKUP_TAG="production-pre-upgrade-20240115"

Pause the problematic cluster:

vcluster pause production-vcluster --namespace production

Restore from pre-upgrade backup:

vcluster resume production-vcluster --namespace production
vcluster restore production-vcluster \
  oci://ghcr.io/my-org/backups:${BACKUP_TAG} \
  --namespace production

Verify the rollback:

vcluster connect production-vcluster --namespace production
kubectl version
kubectl get nodes

Scenario 4: Cross-Cluster Migration

Problem: Need to move virtual cluster to different infrastructure.Solution: Backup and Restore Migration

Create final backup on source cluster:

kubectl config use-context source-cluster
vcluster snapshot create my-vcluster \
  oci://ghcr.io/my-org/backups:migration-final-$(date +%Y%m%d) \
  --namespace production

Prepare target cluster:

kubectl config use-context target-cluster
kubectl create namespace production

Create virtual cluster on target:

vcluster create my-vcluster \
  --namespace production \
  --values vcluster-config.yaml

Restore data:

vcluster restore my-vcluster \
  oci://ghcr.io/my-org/backups:migration-final-20240115 \
  --namespace production \
  --restore-volumes

Update external references (DNS, ingress, etc.)

Verify migration:

vcluster connect my-vcluster --namespace production
# Run your test suite

Decommission source cluster after validation period

Backup Retention and Cleanup

Manual Cleanup

Remove old backups manually:

# For OCI registries
docker rmi ghcr.io/my-org/backups:production-20231215-140530

# For S3
aws s3 rm s3://my-backup-bucket/vclusters/production-20231215-140530.tar.gz

Automated Retention Policy

Implement retention policies in your backup script:

#!/bin/bash
# backup-with-retention.sh

VCLUSTER_NAME="production-vcluster"
NAMESPACE="production"
BACKUP_REPO="oci://ghcr.io/my-org/backups"
RETENTION_DAYS=30

# Create new backup
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
vcluster snapshot create ${VCLUSTER_NAME} \
  ${BACKUP_REPO}:${VCLUSTER_NAME}-${TIMESTAMP} \
  --namespace ${NAMESPACE}

# Delete backups older than retention period
CUTOFF_DATE=$(date -d "${RETENTION_DAYS} days ago" +%Y%m%d)

# List and filter old backups
for tag in $(docker images ghcr.io/my-org/backups --format "{{.Tag}}" | grep "^${VCLUSTER_NAME}-"); do
  BACKUP_DATE=$(echo $tag | grep -oP '\d{8}' | head -1)
  if [ "$BACKUP_DATE" -lt "$CUTOFF_DATE" ]; then
    echo "Removing old backup: $tag"
    docker rmi ghcr.io/my-org/backups:$tag
  fi
done

Monitoring and Verification

Verify Backup Integrity

Test backups regularly to ensure they can be restored:

#!/bin/bash
# verify-backup.sh

BACKUP_TAG="production-20240115-140530"
TEST_NAMESPACE="backup-verification"

# Create temporary cluster
vcluster create backup-test --namespace ${TEST_NAMESPACE} --create-namespace

# Attempt restore
if vcluster restore backup-test \
  oci://ghcr.io/my-org/backups:${BACKUP_TAG} \
  --namespace ${TEST_NAMESPACE}; then
  echo "✓ Backup ${BACKUP_TAG} verified successfully"
  
  # Optional: Run smoke tests
  vcluster connect backup-test --namespace ${TEST_NAMESPACE}
  # Run your test suite here
  
  # Cleanup
  vcluster delete backup-test --namespace ${TEST_NAMESPACE}
  kubectl delete namespace ${TEST_NAMESPACE}
else
  echo "✗ Backup ${BACKUP_TAG} verification FAILED"
  exit 1
fi

Backup Monitoring Dashboard

Create alerts for backup failures:

apiVersion: v1
kind: ConfigMap
metadata:
  name: backup-alerts
  namespace: monitoring
data:
  alerts.yaml: |
    groups:
    - name: vcluster-backup
      rules:
      - alert: BackupJobFailed
        expr: kube_job_status_failed{job_name=~"vcluster-.*-backup.*"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "vCluster backup job failed"
          description: "Backup job {{ $labels.job_name }} has failed"
      
      - alert: BackupNotRunning
        expr: time() - kube_job_status_completion_time{job_name=~"vcluster-.*-backup.*"} > 86400
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "vCluster backup has not run in 24 hours"
          description: "No successful backup for more than 24 hours"

Troubleshooting

Restore Fails with Namespace Conflicts

Error: Resources already exist in the virtual cluster.Solutions:

Option A: Delete existing resources first:

vcluster connect my-vcluster --namespace production
kubectl delete all --all --all-namespaces
vcluster disconnect

# Then restore
vcluster restore my-vcluster oci://ghcr.io/my-org/backups:backup-tag

Option B: Restore to a new virtual cluster:

vcluster create my-vcluster-restored --namespace production-restored
vcluster restore my-vcluster-restored \
  oci://ghcr.io/my-org/backups:backup-tag \
  --namespace production-restored

Backup Storage Quota Exceeded

Problem: Cannot create new backups due to storage limits.Solutions:

Clean up old backups:

# List backups by size
docker images ghcr.io/my-org/backups --format "table {{.Repository}}:{{.Tag}}\t{{.Size}}"

Implement retention policy (see above)
Use backup compression (enabled by default)
Consider tiered storage (frequent backups on fast storage, archives on cheaper storage)

Partial Restore Needed

Problem: Only specific resources need to be restored.Solution:

Restore to temporary cluster:

vcluster create temp-restore --namespace temp
vcluster restore temp-restore oci://ghcr.io/my-org/backups:backup-tag

Extract specific resources:

vcluster connect temp-restore --namespace temp
kubectl get deployment,service,configmap -n target-namespace -o yaml > resources.yaml

Apply to production:

vcluster connect production-vcluster --namespace production
kubectl apply -f resources.yaml

Clean up:

vcluster delete temp-restore --namespace temp

Best Practices

Test Restores Regularly

Schedule quarterly disaster recovery drills to verify backup integrity and team readiness.

Multiple Storage Locations

Store backups in multiple locations and regions to protect against regional failures.

Document Procedures

Maintain runbooks for common recovery scenarios with step-by-step instructions.

Automate Everything

Use CronJobs and operators to automate backup creation, rotation, and verification.

Monitor Backup Health

Set up alerts for backup failures and monitor storage capacity.

Secure Backup Data

Encrypt backups, use IAM roles, and restrict access with RBAC.

Next Steps

Upgrading

Learn safe upgrade procedures using backups

Monitoring

Monitor backup operations and virtual cluster health

Get Started

Architecture

Deployment

Operations

Resource Syncing

Use Cases

Security

Integrations

​Overview

​Understanding Backup vs Snapshot

Snapshot

Backup

​Creating Backups

​Full Backup (Resources Only)

​Full Backup (Including Volumes)

​Restoring from Backup

​Basic Restore

​Restore with Volumes

​Backup Strategies

​Strategy 1: Scheduled Backups

​Strategy 2: Pre-Change Backups

​Strategy 3: 3-2-1 Backup Strategy

​Recovery Scenarios

​Backup Retention and Cleanup

​Manual Cleanup

​Automated Retention Policy

​Monitoring and Verification

​Verify Backup Integrity

​Backup Monitoring Dashboard

​Troubleshooting

​Best Practices

Test Restores Regularly

Multiple Storage Locations

Document Procedures

Automate Everything

Monitor Backup Health

Secure Backup Data

​Next Steps

Upgrading

Monitoring

Build docs developers (and LLMs) love

Overview

Understanding Backup vs Snapshot

Creating Backups

Full Backup (Resources Only)

Full Backup (Including Volumes)

Restoring from Backup

Basic Restore

Restore with Volumes

Backup Strategies

Strategy 1: Scheduled Backups

Strategy 2: Pre-Change Backups

Strategy 3: 3-2-1 Backup Strategy

Recovery Scenarios

Backup Retention and Cleanup

Manual Cleanup

Automated Retention Policy

Monitoring and Verification

Verify Backup Integrity

Backup Monitoring Dashboard

Troubleshooting

Best Practices

Next Steps