Skip to main content
OpenCLIP supports remote training workflows that allow you to resume from remote checkpoints and automatically back up training progress to cloud storage. This is particularly useful for large-scale training on cloud infrastructure or when using shared storage systems.

Overview

Remote training features in OpenCLIP:
  1. Resume from remote paths: Load checkpoints directly from S3 or other remote storage
  2. Automatic backup: Continuously sync training checkpoints to remote storage
  3. fsspec support: Work with any filesystem supported by fsspec
  4. Checkpoint management: Automatically clean up old checkpoints to save space

Resuming from Remote Checkpoints

You can resume training directly from a remote checkpoint without downloading it first.

Resume from S3

Use the S3 URI directly in the --resume flag:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --resume s3://my-bucket/checkpoints/epoch_10.pt \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32

Resume from Other Remote Storage

OpenCLIP uses fsspec to support various storage backends:
# Google Cloud Storage
--resume gs://my-bucket/checkpoints/epoch_10.pt

# Azure Blob Storage
--resume az://my-container/checkpoints/epoch_10.pt

# HTTP/HTTPS
--resume https://my-server.com/checkpoints/epoch_10.pt

Complete Resume Example

python -m open_clip_train.main \
    --train-data "/data/laion400m/train-{0000..4000}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --precision amp \
    --workers 8 \
    --model ViT-L-14 \
    --resume s3://my-training-bucket/runs/vitl14-run1/checkpoints/epoch_15.pt \
    --epochs 32 \
    --lr 1e-3 \
    --warmup 2000

Automatic Remote Backup

Continuously back up training checkpoints to remote storage during training. This prevents data loss and enables easy resume from any point.

Basic Remote Sync Setup

Use --remote-sync to specify the remote destination:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/training-runs \
    --name my-experiment
This will:
  1. Save checkpoints locally to /scratch/training/my-experiment/
  2. Sync to s3://my-bucket/training-runs/my-experiment/
  3. Run sync in background every 5 minutes (default)

Remote Sync Parameters

—remote-sync

Specify the remote path for backup:
--remote-sync s3://my-bucket/training-runs
Supported formats:
  • S3: s3://bucket-name/path
  • S3 with credentials: s3://bucket-name/path (uses AWS credentials)
  • Other fsspec backends: gs://, az://, etc.

—remote-sync-frequency

How often to sync (in seconds):
--remote-sync-frequency 300  # Sync every 5 minutes (default)
--remote-sync-frequency 600  # Sync every 10 minutes
--remote-sync-frequency 1800 # Sync every 30 minutes
Recommendations:
  • Fast storage: 300 seconds (5 minutes)
  • Slow storage: 900-1800 seconds (15-30 minutes)
  • Large checkpoints: 600+ seconds
  • Small checkpoints: 300 seconds

—remote-sync-protocol

Specify the sync protocol:
--remote-sync-protocol s3      # Use S3 protocol (default, recommended)
--remote-sync-protocol fsspec  # Use fsspec (experimental, slow)
Note: The fsspec protocol is currently experimental and very slow. Use s3 for production workloads.

Complete Remote Training Examples

Example 1: S3 Training with Backup

Train with local SSD and automatic S3 backup:
python -m open_clip_train.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data "/data/cc12m/train-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 8 \
    --model ViT-B-32 \
    --warmup 2000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --logs /scratch/openclip \
    --remote-sync s3://my-training-bucket/experiments \
    --remote-sync-frequency 300 \
    --name vitb32-cc12m \
    --imagenet-val /data/imagenet/validation/
The sync process:
  1. Local checkpoints: /scratch/openclip/vitb32-cc12m/checkpoints/
  2. Remote backup: s3://my-training-bucket/experiments/vitb32-cc12m/checkpoints/
  3. Syncs every 5 minutes
  4. Final sync when training completes

Example 2: Multi-GPU with Remote Sync

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --train-data "/data/laion400m/train-{0000..4000}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --batch-size 256 \
    --precision amp \
    --workers 8 \
    --model ViT-L-14 \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad \
    --warmup 2000 \
    --lr 1e-3 \
    --epochs 32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/large-runs \
    --remote-sync-frequency 600 \
    --delete-previous-checkpoint \
    --name vitl14-laion400m

Example 3: Resume from S3 and Continue Syncing

python -m open_clip_train.main \
    --model ViT-B-32 \
    --resume s3://my-bucket/experiments/run1/checkpoints/epoch_10.pt \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/experiments \
    --name run1

Checkpoint Management

Delete Previous Checkpoints

Save disk space by automatically deleting old checkpoints:
--delete-previous-checkpoint
This will:
  • Keep only the most recent checkpoint locally
  • Delete previous checkpoints after saving a new one
  • Remote backups are unaffected (all checkpoints synced)
  • Useful when local storage is limited
Example:
python -m open_clip_train.main \
    --model ViT-L-14 \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/runs \
    --delete-previous-checkpoint \
    --name vitl14-experiment
Result:
  • Local: Only epoch_latest.pt is kept
  • Remote: All epochs synced to S3 (epoch_1.pt, epoch_2.pt, etc.)

Resume Latest from Remote

When using --resume latest with remote sync:
python -m open_clip_train.main \
    --model ViT-B-32 \
    --resume latest \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --batch-size 256 \
    --epochs 32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/runs \
    --remote-sync-protocol s3 \
    --name my-experiment
Important limitations:
  • Only works with --remote-sync-protocol s3
  • Does not work with --save-most-recent
  • Checks remote storage for latest checkpoint
  • May not find checkpoint if sync is still in progress

AWS S3 Configuration

AWS Credentials

Ensure AWS credentials are configured:
# Option 1: AWS CLI configuration
aws configure

# Option 2: Environment variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2

# Option 3: IAM role (on EC2)
# Credentials automatically available

S3 Bucket Setup

# Create bucket
aws s3 mb s3://my-training-bucket

# Set lifecycle policy to manage old checkpoints
aws s3api put-bucket-lifecycle-configuration \
    --bucket my-training-bucket \
    --lifecycle-configuration file://lifecycle.json
lifecycle.json:
{
  "Rules": [
    {
      "Id": "DeleteOldCheckpoints",
      "Status": "Enabled",
      "Prefix": "experiments/",
      "Expiration": {
        "Days": 90
      }
    }
  ]
}

Workflow Patterns

Pattern 1: Fast Local Storage + S3 Backup

Best for: Training on cloud instances with local SSDs
# Use local SSD for speed
--logs /scratch/training

# Back up to S3 for durability
--remote-sync s3://my-bucket/runs

# Clean up local storage
--delete-previous-checkpoint

Pattern 2: Resume After Interruption

Best for: Spot instances, preemptible VMs
# Initial training
python -m open_clip_train.main \
    --model ViT-B-32 \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/runs \
    --name experiment-1 \
    ...

# After interruption, resume
python -m open_clip_train.main \
    --model ViT-B-32 \
    --resume s3://my-bucket/runs/experiment-1/checkpoints/epoch_latest.pt \
    --logs /scratch/training \
    --remote-sync s3://my-bucket/runs \
    --name experiment-1 \
    ...

Pattern 3: Centralized Checkpoint Storage

Best for: Team collaboration, multiple training nodes
# All nodes sync to same bucket
--remote-sync s3://team-bucket/shared-experiments

# Each experiment has unique name
--name ${USER}-${MODEL}-${DATE}

# Team members can resume from any checkpoint
--resume s3://team-bucket/shared-experiments/alice-vitb32-2024/checkpoints/epoch_10.pt

Sync Process Details

How Sync Works

  1. Training starts, background sync process launched
  2. Every --remote-sync-frequency seconds:
    • Sync all files from local logs directory to remote
    • Only uploads changed/new files
    • Sync happens in background, doesn’t block training
  3. When training completes:
    • Final sync ensures all checkpoints uploaded
    • Process waits for final sync to complete

What Gets Synced

All files in the local logs directory:
  • Checkpoints (epoch_*.pt)
  • TensorBoard logs
  • Training logs
  • Configuration files
  • Any other files in the directory

Sync Command (S3)

Under the hood, uses AWS CLI:
aws s3 sync /local/path s3://bucket/remote/path --exact-timestamps

Performance Considerations

Sync Frequency

Too frequent:
  • Wastes bandwidth
  • May impact training performance
  • Unnecessary for large checkpoints
Too infrequent:
  • Risk losing more progress on failure
  • Longer wait for final sync
Recommended:
  • Small models: 300-600 seconds
  • Large models: 600-1800 seconds
  • Fast networks: 300 seconds
  • Slow networks: 900+ seconds

Network Impact

  • Sync runs in background process
  • Minimal impact on training throughput
  • May affect data loading if sharing bandwidth
  • Use local data loading when possible

Troubleshooting

Sync Failing

Check AWS credentials:
aws s3 ls s3://my-bucket/
Check bucket permissions:
aws s3api get-bucket-acl --bucket my-bucket
Test manual sync:
aws s3 sync /local/path s3://my-bucket/test-path

Resume from S3 Failing

Verify checkpoint exists:
aws s3 ls s3://my-bucket/path/to/checkpoint.pt
Check file size:
aws s3 ls s3://my-bucket/path/to/checkpoint.pt --human-readable
Test download manually:
aws s3 cp s3://my-bucket/path/to/checkpoint.pt /tmp/test.pt

Slow Sync

  • Use --remote-sync-protocol s3 (not fsspec)
  • Increase --remote-sync-frequency
  • Check network bandwidth
  • Consider using S3 Transfer Acceleration
  • Reduce checkpoint size if possible

”Resume latest” Not Finding Checkpoint

  • Ensure sync completed before trying to resume
  • Check remote path matches expectations
  • Use explicit checkpoint path instead of “latest”
  • Verify --remote-sync-protocol s3 is set

Best Practices

  1. Always use remote sync for long training runs
    • Prevents data loss from hardware failures
    • Enables easy resume from any point
  2. Use local fast storage with remote backup
    • Local SSD for training speed
    • S3 for durability and sharing
  3. Delete old local checkpoints
    • Use --delete-previous-checkpoint
    • Keep all checkpoints in remote storage
  4. Set appropriate sync frequency
    • Balance between safety and performance
    • Consider checkpoint size and network speed
  5. Test sync before long runs
    • Verify credentials and permissions
    • Test manual sync first
    • Monitor first few syncs
  6. Use unique experiment names
    • Prevents conflicts in shared storage
    • Makes checkpoints easy to find
    • Include timestamp or identifier
  7. Monitor sync process
    • Check logs for sync errors
    • Verify files appearing in remote storage
    • Test resume before needing it

Build docs developers (and LLMs) love