Gradient Accumulation

Gradient accumulation is a technique that allows you to train with effectively larger batch sizes than your GPU memory would normally allow. This is crucial for CLIP training, where larger batch sizes typically lead to better performance.

Overview

Instead of updating model weights after every batch, gradient accumulation:

Computes gradients for multiple small batches
Accumulates (sums) these gradients
Updates the model weights once after processing all accumulated batches

This simulates training with a batch size of batch_size × accum_freq × num_gpus.

Basic Usage

Use the --accum-freq flag to specify how many batches to accumulate:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --batch-size 128 \
    --accum-freq 4 \
    --train-data "/path/to/train_data.tar" \
    --dataset-type webdataset \
    --epochs 32

In this example:

Per-GPU batch size: 128
Accumulation frequency: 4
Effective batch size per GPU: 128 × 4 = 512
With 8 GPUs: Total effective batch size = 512 × 8 = 4,096

How It Works

Gradient accumulation modifies the training loop:

Without Gradient Accumulation (accum-freq = 1)

for batch in dataloader:
    outputs = model(batch)  # Forward pass
    loss = criterion(outputs)
    loss.backward()          # Compute gradients
    optimizer.step()         # Update weights
    optimizer.zero_grad()    # Clear gradients

With Gradient Accumulation (accum-freq = 4)

for i, batch in enumerate(dataloader):
    outputs = model(batch)      # Forward pass
    loss = criterion(outputs)
    loss.backward()              # Accumulate gradients
    
    if (i + 1) % accum_freq == 0:
        optimizer.step()         # Update weights
        optimizer.zero_grad()    # Clear gradients

Effective Batch Size Calculation

The effective batch size is:

Effective Batch Size = batch_size × accum_freq × num_gpus

Examples:

Per-GPU Batch	Accum Freq	GPUs	Effective Batch Size
128	1	8	1,024
128	2	8	2,048
128	4	8	4,096
64	8	8	4,096
256	1	4	1,024
256	4	4	4,096

Memory vs Speed Tradeoffs

Memory Considerations

Advantages:

Reduces per-step memory usage for model activations
Enables training larger models on limited hardware
Allows simulation of large batch sizes

Costs:

Features from all accumulated batches are stored in memory
Additional memory needed for intermediate loss computations
Each batch’s features are cached until the update step

Speed Considerations

Impact on Training Speed:

~2× forward passes per example (one with gradients, one without)
Samples per second remains approximately constant
Time per update step increases proportionally with accum_freq
Overall throughput (samples/second) stays similar

Example Performance:

accum-freq=1: 100 steps/epoch, 1000 samples/sec
accum-freq=4: 25 steps/epoch, 1000 samples/sec

Note: You process the same data but with fewer parameter updates.

When to Use Gradient Accumulation

Use Gradient Accumulation When:

GPU Memory is Limited
- Cannot fit desired batch size in memory
- Training large models (ViT-L, ViT-H, ViT-g)
- Using high-resolution images
Constrained GPU Resources
- Limited number of GPUs available
- Need to match batch sizes from papers
- Simulating larger-scale training
After Trying Other Techniques
- Already using --grad-checkpointing
- Already using --local-loss --gather-with-grad
- Already optimized per-GPU batch size

Avoid When:

Memory is Sufficient: If you can fit larger batches, do so directly
Using Distillation: Distillation requires --accum-freq 1
Training is Already Slow: Gradient accumulation adds overhead

Recommended Workflow

Follow this sequence to optimize batch size:

# 1. Start with largest possible batch size
--batch-size 512

# 2. If OOM, enable memory-saving features
--batch-size 512 \
--grad-checkpointing \
--local-loss \
--gather-with-grad

# 3. If still OOM, reduce batch size
--batch-size 256 \
--grad-checkpointing \
--local-loss \
--gather-with-grad

# 4. Finally, use gradient accumulation to increase effective batch size
--batch-size 256 \
--accum-freq 2 \
--grad-checkpointing \
--local-loss \
--gather-with-grad

Examples

Single GPU Training

Simulate a large batch size on a single GPU:

python -m open_clip_train.main \
    --model ViT-B-32 \
    --batch-size 64 \
    --accum-freq 16 \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --precision amp \
    --workers 4 \
    --epochs 32

Effective batch size: 64 × 16 = 1,024

Multi-GPU Training

Scale to very large batch sizes:

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-L-14 \
    --batch-size 128 \
    --accum-freq 4 \
    --train-data "/data/laion400m/train-{0000..4000}.tar" \
    --dataset-type webdataset \
    --precision amp \
    --workers 8 \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad \
    --epochs 32

Effective batch size: 128 × 4 × 8 = 4,096

Large Model Training

Train huge models with gradient accumulation:

torchrun --nproc_per_node 8 -m open_clip_train.main \
    --model ViT-H-14 \
    --batch-size 32 \
    --accum-freq 8 \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --precision amp \
    --workers 8 \
    --grad-checkpointing \
    --local-loss \
    --gather-with-grad \
    --epochs 32 \
    --lr 1e-3 \
    --warmup 2000

Effective batch size: 32 × 8 × 8 = 2,048

High Resolution Images

Train with larger image sizes:

python -m open_clip_train.main \
    --model ViT-L-14-336 \
    --batch-size 64 \
    --accum-freq 4 \
    --force-image-size 336 \
    --train-data "/data/train.tar" \
    --dataset-type webdataset \
    --precision amp \
    --grad-checkpointing \
    --epochs 32

Effective batch size: 64 × 4 = 256

Learning Rate Adjustment

When using gradient accumulation, the effective batch size changes but the number of gradient steps remains the same per epoch. Generally: No learning rate adjustment needed when only changing --accum-freq However, if you’re matching a specific training recipe that used a different batch size:

# Original: batch_size=512, lr=1e-3
# New: batch_size=128, accum_freq=4 (same effective batch)
# Keep the same learning rate
--batch-size 128 \
--accum-freq 4 \
--lr 1e-3

Implementation Details

Forward Passes

With gradient accumulation, there are two forward passes per sample:

First pass (with gradients): Computes loss and gradients
Second pass (with torch.no_grad()): Computes features for contrastive loss

This is necessary for the contrastive learning objective in CLIP.

Loss Computation

The loss is computed accum_freq times before each weight update:

Each accumulated batch computes its own loss
Gradients are accumulated across all batches
Final gradient is the sum (effectively the mean due to normalization)

Memory Usage

Memory is used for:

Model weights and optimizer states
Gradients (accumulated across batches)
Features from all accum_freq batches
Current batch activations

Monitoring Training

Key metrics when using gradient accumulation:

# Samples per second remains constant
samples_per_second = accum_freq × batch_size × world_size / batch_time

# Steps per epoch decreases
steps_per_epoch = num_samples / (batch_size × accum_freq × world_size)

# Total samples seen is unchanged
total_samples = steps × batch_size × accum_freq × world_size

Compatibility

Works With:

Mixed precision training (--precision amp)
Gradient checkpointing (--grad-checkpointing)
Local loss (--local-loss)
Gather with gradients (--gather-with-grad)
Distributed training (multi-GPU)
All model architectures

Does Not Work With:

Model distillation (--distill-model) - requires --accum-freq 1

Best Practices

Start Small: Test with --accum-freq 2 before using larger values
Power of 2: Use powers of 2 for accum_freq (2, 4, 8) for better memory alignment
Balance: Find the sweet spot between batch_size and accum_freq
Memory First: Maximize batch_size before increasing accum_freq
Monitor: Watch memory usage and training speed to find optimal settings
Document: Record your effective batch size for reproducibility

Troubleshooting

Still Running Out of Memory

# Reduce batch size further
--batch-size 64 --accum-freq 8

# Enable gradient checkpointing
--grad-checkpointing

# Use lower precision
--precision fp16

# Reduce number of workers
--workers 4

Training is Too Slow

# Reduce accum_freq if possible
--accum-freq 2  # instead of 4

# Increase batch size
--batch-size 256  # instead of 128

# Enable amp for mixed precision
--precision amp

Unstable Training

# Increase warmup
--warmup 5000

# Adjust learning rate
--lr 5e-4  # reduce if unstable

# Add gradient clipping
--grad-clip-norm 1.0

References

For more information on gradient accumulation for contrastive learning:

Don’t Use Large Mini-Batches, Use Local SGD - Cui et al.
Gradient Accumulation for Large-Scale Training - Pham et al.

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Overview

​Basic Usage

​How It Works

​Without Gradient Accumulation (accum-freq = 1)

​With Gradient Accumulation (accum-freq = 4)

​Effective Batch Size Calculation

​Memory vs Speed Tradeoffs

​Memory Considerations

​Speed Considerations

​When to Use Gradient Accumulation

​Use Gradient Accumulation When:

​Avoid When:

​Recommended Workflow

​Examples

​Single GPU Training

​Multi-GPU Training

​Large Model Training

​High Resolution Images

​Learning Rate Adjustment

​Implementation Details

​Forward Passes

​Loss Computation

​Memory Usage

​Monitoring Training

​Compatibility

​Works With:

​Does Not Work With:

​Best Practices

​Troubleshooting

​Still Running Out of Memory

​Training is Too Slow

​Unstable Training

​References

Build docs developers (and LLMs) love