Accelerate training and reduce memory usage with Automatic Mixed Precision (AMP)
Automatic Mixed Precision (AMP) allows you to train models faster and with less memory by automatically using lower precision (FP16/BF16) for operations that can tolerate it, while maintaining FP32 precision for operations that require it.
The GradScaler prevents gradient underflow by scaling the loss:
# Create scaler once at the beginning of trainingscaler = GradScaler(device='cuda')
3
Wrap Forward Pass with Autocast
Use autocast to automatically cast operations to FP16:
for epoch in epochs: for batch in dataloader: optimizer.zero_grad() # Autocast wraps the forward pass with autocast(device_type='cuda', dtype=torch.float16): output = model(batch) loss = criterion(output, target) # Scale loss and call backward scaler.scale(loss).backward() # Unscale gradients and step optimizer scaler.step(optimizer) # Update scaler for next iteration scaler.update()
# Best for NVIDIA GPUs (Volta and newer)with autocast(device_type='cuda', dtype=torch.float16): output = model(input)
BF16 (bfloat16) offers the same dynamic range as FP32 but with reduced precision. It’s more numerically stable than FP16 and is recommended when available.
scaler = GradScaler( device='cuda', init_scale=2.**16, # Initial scale factor growth_factor=2.0, # Multiply scale by this if no inf/nan backoff_factor=0.5, # Multiply scale by this if inf/nan found growth_interval=2000, # Steps before increasing scale enabled=True # Enable/disable scaler)
When using gradient clipping with AMP, unscale gradients first:
for batch in dataloader: optimizer.zero_grad() with autocast(device_type='cuda', dtype=torch.float16): output = model(batch) loss = criterion(output, target) scaler.scale(loss).backward() # Unscale gradients before clipping scaler.unscale_(optimizer) # Now clip gradients (in FP32) torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Step optimizer scaler.step(optimizer) scaler.update()
Always call scaler.unscale_() before gradient clipping. Clipping scaled gradients leads to incorrect behavior.
Combine AMP with gradient accumulation for large effective batch sizes:
scaler = GradScaler(device='cuda')accumulation_steps = 4for i, batch in enumerate(dataloader): with autocast(device_type='cuda', dtype=torch.float16): output = model(batch) loss = criterion(output, target) # Scale loss by accumulation steps loss = loss / accumulation_steps scaler.scale(loss).backward() # Only step every accumulation_steps if (i + 1) % accumulation_steps == 0: scaler.step(optimizer) scaler.update() optimizer.zero_grad()
# Disable AMP with a flaguse_amp = True # Set to False for debuggingwith autocast(device_type='cuda', dtype=torch.float16, enabled=use_amp): output = model(input)
from torch.profiler import profile, ProfilerActivitywith profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof: with autocast(device_type='cuda', dtype=torch.float16): output = model(input) loss = criterion(output, target) loss.backward()print(prof.key_averages().table(sort_by="cuda_time_total"))
# Enable anomaly detectiontorch.autograd.set_detect_anomaly(True)# Check for NaN/Inf in lossif torch.isnan(loss) or torch.isinf(loss): print("Loss is NaN or Inf!") # Try reducing learning rate or using BF16