Overview
Instead of updating model weights after every batch, gradient accumulation:- Computes gradients for multiple small batches
- Accumulates (sums) these gradients
- Updates the model weights once after processing all accumulated batches
batch_size × accum_freq × num_gpus.
Basic Usage
Use the--accum-freq flag to specify how many batches to accumulate:
- Per-GPU batch size: 128
- Accumulation frequency: 4
- Effective batch size per GPU: 128 × 4 = 512
- With 8 GPUs: Total effective batch size = 512 × 8 = 4,096
How It Works
Gradient accumulation modifies the training loop:Without Gradient Accumulation (accum-freq = 1)
With Gradient Accumulation (accum-freq = 4)
Effective Batch Size Calculation
The effective batch size is:| Per-GPU Batch | Accum Freq | GPUs | Effective Batch Size |
|---|---|---|---|
| 128 | 1 | 8 | 1,024 |
| 128 | 2 | 8 | 2,048 |
| 128 | 4 | 8 | 4,096 |
| 64 | 8 | 8 | 4,096 |
| 256 | 1 | 4 | 1,024 |
| 256 | 4 | 4 | 4,096 |
Memory vs Speed Tradeoffs
Memory Considerations
Advantages:- Reduces per-step memory usage for model activations
- Enables training larger models on limited hardware
- Allows simulation of large batch sizes
- Features from all accumulated batches are stored in memory
- Additional memory needed for intermediate loss computations
- Each batch’s features are cached until the update step
Speed Considerations
Impact on Training Speed:- ~2× forward passes per example (one with gradients, one without)
- Samples per second remains approximately constant
- Time per update step increases proportionally with
accum_freq - Overall throughput (samples/second) stays similar
When to Use Gradient Accumulation
Use Gradient Accumulation When:
-
GPU Memory is Limited
- Cannot fit desired batch size in memory
- Training large models (ViT-L, ViT-H, ViT-g)
- Using high-resolution images
-
Constrained GPU Resources
- Limited number of GPUs available
- Need to match batch sizes from papers
- Simulating larger-scale training
-
After Trying Other Techniques
- Already using
--grad-checkpointing - Already using
--local-loss --gather-with-grad - Already optimized per-GPU batch size
- Already using
Avoid When:
- Memory is Sufficient: If you can fit larger batches, do so directly
- Using Distillation: Distillation requires
--accum-freq 1 - Training is Already Slow: Gradient accumulation adds overhead
Recommended Workflow
Follow this sequence to optimize batch size:Examples
Single GPU Training
Simulate a large batch size on a single GPU:Multi-GPU Training
Scale to very large batch sizes:Large Model Training
Train huge models with gradient accumulation:High Resolution Images
Train with larger image sizes:Learning Rate Adjustment
When using gradient accumulation, the effective batch size changes but the number of gradient steps remains the same per epoch. Generally: No learning rate adjustment needed when only changing--accum-freq
However, if you’re matching a specific training recipe that used a different batch size:
Implementation Details
Forward Passes
With gradient accumulation, there are two forward passes per sample:- First pass (with gradients): Computes loss and gradients
- Second pass (with
torch.no_grad()): Computes features for contrastive loss
Loss Computation
The loss is computedaccum_freq times before each weight update:
- Each accumulated batch computes its own loss
- Gradients are accumulated across all batches
- Final gradient is the sum (effectively the mean due to normalization)
Memory Usage
Memory is used for:- Model weights and optimizer states
- Gradients (accumulated across batches)
- Features from all
accum_freqbatches - Current batch activations
Monitoring Training
Key metrics when using gradient accumulation:Compatibility
Works With:
- Mixed precision training (
--precision amp) - Gradient checkpointing (
--grad-checkpointing) - Local loss (
--local-loss) - Gather with gradients (
--gather-with-grad) - Distributed training (multi-GPU)
- All model architectures
Does Not Work With:
- Model distillation (
--distill-model) - requires--accum-freq 1
Best Practices
- Start Small: Test with
--accum-freq 2before using larger values - Power of 2: Use powers of 2 for
accum_freq(2, 4, 8) for better memory alignment - Balance: Find the sweet spot between
batch_sizeandaccum_freq - Memory First: Maximize
batch_sizebefore increasingaccum_freq - Monitor: Watch memory usage and training speed to find optimal settings
- Document: Record your effective batch size for reproducibility
Troubleshooting
Still Running Out of Memory
Training is Too Slow
Unstable Training
References
For more information on gradient accumulation for contrastive learning:- Don’t Use Large Mini-Batches, Use Local SGD - Cui et al.
- Gradient Accumulation for Large-Scale Training - Pham et al.
