Overview
Backup and restore operations ensure business continuity and data protection for your virtual clusters. vCluster provides snapshot-based backup and restore capabilities that capture the complete state of your virtual environment.
Understanding Backup vs Snapshot
While closely related, these terms have distinct meanings:
Snapshot A point-in-time capture of cluster state, typically short-term and used for quick recovery or cloning.
Backup A snapshot intended for long-term retention, disaster recovery, and compliance purposes.
In practice, both use the same snapshot mechanism - the difference is in retention policy and purpose.
Creating Backups
Full Backup (Resources Only)
Create a backup of all Kubernetes resources:
vcluster snapshot create my-vcluster \
oci://ghcr.io/my-org/backups:my-vcluster-full- $( date +%Y%m%d-%H%M%S ) \
--namespace production
This captures:
All Kubernetes resources (Deployments, StatefulSets, Services, etc.)
ConfigMaps and Secrets
Custom Resource Definitions and custom resources
RBAC policies and service accounts
Network policies
Full Backup (Including Volumes)
Create a complete backup including persistent volume data:
vcluster snapshot create my-vcluster \
oci://ghcr.io/my-org/backups:my-vcluster-complete- $( date +%Y%m%d-%H%M%S ) \
--namespace production \
--include-volumes
Volume snapshots require CSI driver support and VolumeSnapshot CRDs installed in the host cluster.
Restoring from Backup
Basic Restore
Restore a virtual cluster from a backup:
vcluster restore my-vcluster \
oci://ghcr.io/my-org/backups:my-vcluster-full-20240115-140530 \
--namespace production
Prepare Virtual Cluster
Ensure the target virtual cluster exists and is running: vcluster list --namespace production
If it doesn’t exist, create it: vcluster create my-vcluster --namespace production
Execute Restore
Run the restore command: vcluster restore my-vcluster \
oci://ghcr.io/my-org/backups:my-vcluster-full-20240115-140530 \
--namespace production
Verify Restoration
Connect to the virtual cluster and verify resources: vcluster connect my-vcluster --namespace production
kubectl get all --all-namespaces
Restore with Volumes
Restore including persistent volume data:
vcluster restore my-vcluster \
oci://ghcr.io/my-org/backups:my-vcluster-complete-20240115-140530 \
--namespace production \
--restore-volumes
Important considerations:
Existing data in PVCs will be overwritten
Volume restoration can take significant time for large volumes
Storage class must support volume snapshots
Backup Strategies
Strategy 1: Scheduled Backups
Implement automated daily backups using CronJobs:
apiVersion : batch/v1
kind : CronJob
metadata :
name : vcluster-daily-backup
namespace : backup-system
spec :
# Run daily at 2 AM
schedule : "0 2 * * *"
successfulJobsHistoryLimit : 7
failedJobsHistoryLimit : 3
jobTemplate :
spec :
template :
spec :
serviceAccountName : vcluster-backup-sa
containers :
- name : backup
image : loftsh/vcluster-cli:latest
env :
- name : TIMESTAMP
value : $(date +%Y%m%d-%H%M%S)
command :
- /bin/sh
- -c
- |
echo "Starting backup at $(date)"
# Backup production vcluster
vcluster snapshot create production-vcluster \
oci://ghcr.io/my-org/backups:production-${TIMESTAMP} \
--namespace production
# Backup staging vcluster
vcluster snapshot create staging-vcluster \
oci://ghcr.io/my-org/backups:staging-${TIMESTAMP} \
--namespace staging
echo "Backup completed at $(date)"
restartPolicy : OnFailure
Strategy 2: Pre-Change Backups
Create backups before significant changes:
#!/bin/bash
# pre-deployment-backup.sh
VCLUSTER_NAME = "production-vcluster"
NAMESPACE = "production"
BACKUP_REPO = "oci://ghcr.io/my-org/backups"
TIMESTAMP = $( date +%Y%m%d-%H%M%S )
echo "Creating pre-deployment backup..."
vcluster snapshot create ${ VCLUSTER_NAME } \
${ BACKUP_REPO } : ${ VCLUSTER_NAME } -pre-deploy- ${ TIMESTAMP } \
--namespace ${ NAMESPACE }
if [ $? -eq 0 ]; then
echo "Backup successful: ${ VCLUSTER_NAME }-pre-deploy-${ TIMESTAMP }"
echo "Proceeding with deployment..."
# Run deployment
kubectl apply -f deployment.yaml
else
echo "Backup failed! Aborting deployment."
exit 1
fi
Strategy 3: 3-2-1 Backup Strategy
Implement the industry-standard 3-2-1 rule:
3 copies of your data
2 different storage types
1 offsite copy
#!/bin/bash
# 3-2-1-backup.sh
VCLUSTER_NAME = "production-vcluster"
NAMESPACE = "production"
TIMESTAMP = $( date +%Y%m%d-%H%M%S )
# Copy 1: Primary storage (OCI registry)
vcluster snapshot create ${ VCLUSTER_NAME } \
oci://ghcr.io/my-org/backups: ${ VCLUSTER_NAME } - ${ TIMESTAMP } \
--namespace ${ NAMESPACE }
# Copy 2: Secondary storage (S3)
vcluster snapshot create ${ VCLUSTER_NAME } \
s3://my-backup-bucket/vclusters/ ${ VCLUSTER_NAME } - ${ TIMESTAMP } .tar.gz \
--namespace ${ NAMESPACE }
# Copy 3: Offsite (different region or provider)
vcluster snapshot create ${ VCLUSTER_NAME } \
s3://my-dr-bucket/vclusters/ ${ VCLUSTER_NAME } - ${ TIMESTAMP } .tar.gz \
--namespace ${ NAMESPACE }
Recovery Scenarios
Scenario 1: Accidental Deletion
Problem: A critical deployment was accidentally deleted.Solution: Targeted Restore
List recent backups:
# Use your registry's CLI to list backups
docker images ghcr.io/my-org/backups
Find the most recent backup before the deletion
Restore to a temporary cluster:
vcluster create recovery-temp --namespace recovery
vcluster restore recovery-temp \
oci://ghcr.io/my-org/backups:production-20240115-140530 \
--namespace recovery
Extract just the needed resources:
vcluster connect recovery-temp --namespace recovery
kubectl get deployment critical-app -o yaml > critical-app.yaml
Restore to production:
vcluster connect production-vcluster --namespace production
kubectl apply -f critical-app.yaml
Clean up temporary cluster:
vcluster delete recovery-temp --namespace recovery
Scenario 2: Complete Cluster Loss
Problem: The entire host cluster or virtual cluster is lost.Solution: Full Restore
Set up new host cluster (if needed)
Install vCluster:
helm repo add loft https://charts.loft.sh
helm repo update
Create new virtual cluster:
vcluster create production-vcluster \
--namespace production \
--create-namespace
Restore from most recent backup:
vcluster restore production-vcluster \
oci://ghcr.io/my-org/backups:production-latest \
--namespace production \
--restore-volumes
Verify all services are running:
vcluster connect production-vcluster --namespace production
kubectl get all --all-namespaces
kubectl get pvc --all-namespaces
Update DNS and external access if needed
Scenario 3: Rollback After Failed Update
Problem: A Kubernetes version upgrade or major change caused issues.Solution: Version Rollback
Identify the pre-upgrade backup:
# Should have been created before upgrade
BACKUP_TAG = "production-pre-upgrade-20240115"
Pause the problematic cluster:
vcluster pause production-vcluster --namespace production
Restore from pre-upgrade backup:
vcluster resume production-vcluster --namespace production
vcluster restore production-vcluster \
oci://ghcr.io/my-org/backups: ${ BACKUP_TAG } \
--namespace production
Verify the rollback:
vcluster connect production-vcluster --namespace production
kubectl version
kubectl get nodes
Scenario 4: Cross-Cluster Migration
Problem: Need to move virtual cluster to different infrastructure.Solution: Backup and Restore Migration
Create final backup on source cluster:
kubectl config use-context source-cluster
vcluster snapshot create my-vcluster \
oci://ghcr.io/my-org/backups:migration-final- $( date +%Y%m%d ) \
--namespace production
Prepare target cluster:
kubectl config use-context target-cluster
kubectl create namespace production
Create virtual cluster on target:
vcluster create my-vcluster \
--namespace production \
--values vcluster-config.yaml
Restore data:
vcluster restore my-vcluster \
oci://ghcr.io/my-org/backups:migration-final-20240115 \
--namespace production \
--restore-volumes
Update external references (DNS, ingress, etc.)
Verify migration:
vcluster connect my-vcluster --namespace production
# Run your test suite
Decommission source cluster after validation period
Backup Retention and Cleanup
Manual Cleanup
Remove old backups manually:
# For OCI registries
docker rmi ghcr.io/my-org/backups:production-20231215-140530
# For S3
aws s3 rm s3://my-backup-bucket/vclusters/production-20231215-140530.tar.gz
Automated Retention Policy
Implement retention policies in your backup script:
#!/bin/bash
# backup-with-retention.sh
VCLUSTER_NAME = "production-vcluster"
NAMESPACE = "production"
BACKUP_REPO = "oci://ghcr.io/my-org/backups"
RETENTION_DAYS = 30
# Create new backup
TIMESTAMP = $( date +%Y%m%d-%H%M%S )
vcluster snapshot create ${ VCLUSTER_NAME } \
${ BACKUP_REPO } : ${ VCLUSTER_NAME } - ${ TIMESTAMP } \
--namespace ${ NAMESPACE }
# Delete backups older than retention period
CUTOFF_DATE = $( date -d "${ RETENTION_DAYS } days ago" +%Y%m%d )
# List and filter old backups
for tag in $( docker images ghcr.io/my-org/backups --format "{{.Tag}}" | grep "^${ VCLUSTER_NAME }-" ); do
BACKUP_DATE = $( echo $tag | grep -oP '\d{8}' | head -1 )
if [ " $BACKUP_DATE " -lt " $CUTOFF_DATE " ]; then
echo "Removing old backup: $tag "
docker rmi ghcr.io/my-org/backups: $tag
fi
done
Monitoring and Verification
Verify Backup Integrity
Test backups regularly to ensure they can be restored:
#!/bin/bash
# verify-backup.sh
BACKUP_TAG = "production-20240115-140530"
TEST_NAMESPACE = "backup-verification"
# Create temporary cluster
vcluster create backup-test --namespace ${ TEST_NAMESPACE } --create-namespace
# Attempt restore
if vcluster restore backup-test \
oci://ghcr.io/my-org/backups: ${ BACKUP_TAG } \
--namespace ${ TEST_NAMESPACE }; then
echo "✓ Backup ${ BACKUP_TAG } verified successfully"
# Optional: Run smoke tests
vcluster connect backup-test --namespace ${ TEST_NAMESPACE }
# Run your test suite here
# Cleanup
vcluster delete backup-test --namespace ${ TEST_NAMESPACE }
kubectl delete namespace ${ TEST_NAMESPACE }
else
echo "✗ Backup ${ BACKUP_TAG } verification FAILED"
exit 1
fi
Backup Monitoring Dashboard
Create alerts for backup failures:
apiVersion : v1
kind : ConfigMap
metadata :
name : backup-alerts
namespace : monitoring
data :
alerts.yaml : |
groups:
- name: vcluster-backup
rules:
- alert: BackupJobFailed
expr: kube_job_status_failed{job_name=~"vcluster-.*-backup.*"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "vCluster backup job failed"
description: "Backup job {{ $labels.job_name }} has failed"
- alert: BackupNotRunning
expr: time() - kube_job_status_completion_time{job_name=~"vcluster-.*-backup.*"} > 86400
for: 1h
labels:
severity: warning
annotations:
summary: "vCluster backup has not run in 24 hours"
description: "No successful backup for more than 24 hours"
Troubleshooting
Restore Fails with Namespace Conflicts
Error: Resources already exist in the virtual cluster.Solutions:
Option A: Delete existing resources first:
vcluster connect my-vcluster --namespace production
kubectl delete all --all --all-namespaces
vcluster disconnect
# Then restore
vcluster restore my-vcluster oci://ghcr.io/my-org/backups:backup-tag
Option B: Restore to a new virtual cluster:
vcluster create my-vcluster-restored --namespace production-restored
vcluster restore my-vcluster-restored \
oci://ghcr.io/my-org/backups:backup-tag \
--namespace production-restored
Backup Storage Quota Exceeded
Problem: Cannot create new backups due to storage limits.Solutions:
Clean up old backups:
# List backups by size
docker images ghcr.io/my-org/backups --format "table {{.Repository}}:{{.Tag}}\t{{.Size}}"
Implement retention policy (see above)
Use backup compression (enabled by default)
Consider tiered storage (frequent backups on fast storage, archives on cheaper storage)
Problem: Only specific resources need to be restored.Solution:
Restore to temporary cluster:
vcluster create temp-restore --namespace temp
vcluster restore temp-restore oci://ghcr.io/my-org/backups:backup-tag
Extract specific resources:
vcluster connect temp-restore --namespace temp
kubectl get deployment,service,configmap -n target-namespace -o yaml > resources.yaml
Apply to production:
vcluster connect production-vcluster --namespace production
kubectl apply -f resources.yaml
Clean up:
vcluster delete temp-restore --namespace temp
Best Practices
Test Restores Regularly Schedule quarterly disaster recovery drills to verify backup integrity and team readiness.
Multiple Storage Locations Store backups in multiple locations and regions to protect against regional failures.
Document Procedures Maintain runbooks for common recovery scenarios with step-by-step instructions.
Automate Everything Use CronJobs and operators to automate backup creation, rotation, and verification.
Monitor Backup Health Set up alerts for backup failures and monitor storage capacity.
Secure Backup Data Encrypt backups, use IAM roles, and restrict access with RBAC.
Next Steps
Upgrading Learn safe upgrade procedures using backups
Monitoring Monitor backup operations and virtual cluster health