alpha — Per-class weights derived from inverse class frequency in the training set (clipped to [0.5, 10.0]). Under-represented classes receive a higher multiplier.
gamma=2.0 — The (1 - pt)^gamma factor reduces the loss contribution of correctly-classified, high-confidence examples. The model therefore focuses gradient updates on hard or misclassified samples regardless of class.
smoothing=0.1 — Prevents the model from becoming overconfident by assigning 10% of the probability mass uniformly across non-target classes.
criterion = FocalLoss( alpha=self.class_weights.to(self.device), # from WeightedRandomSampler weights gamma=2.0, smoothing=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts( optimizer, T_0=20, # first restart period (epochs) T_mult=2, # period doubles after each restart eta_min=1e-6 # minimum learning rate)
The scheduler is stepped every batch rather than every epoch:
if scheduler and not isinstance(scheduler, optim.lr_scheduler.ReduceLROnPlateau): scheduler.step()
This produces a smooth cosine decay within each 20-epoch cycle. After each restart the cycle length doubles (20 → 40 → 80 epochs), allowing the model to escape local minima and re-explore the loss landscape with a temporarily higher learning rate.
Augmentation is applied at the feature sequence level inside EnhancedPreExtractedFeaturesDataset.__getitem__() — after CNN features are loaded from disk, before they enter the model.
if self.augment and self.tta_mode is None: num_frames = features.shape[0] if num_frames > 8: # 1. Temporal subsampling (50% probability) if random.random() < 0.5: sample_ratio = random.uniform(0.7, 1.0) # keep 70–100% of frames new_length = max(int(num_frames * sample_ratio), 8) indices = sorted(random.sample(range(num_frames), new_length)) features = features[indices] # 2. Temporal shift (30% probability) if random.random() < 0.3: shift = random.randint(-3, 3) if shift != 0: features = torch.roll(features, shifts=shift, dims=0) # 3. Gaussian noise (20% probability) if random.random() < 0.2: noise = torch.randn_like(features) * 0.01 features = features + noise
Augmentation
Probability
Parameters
Temporal subsampling
50%
Retain 70–100% of frames, randomly sampled
Temporal shift
30%
Circular roll by ±3 frames
Gaussian noise
20%
σ = 0.01 added to all feature dimensions
Augmentation runs on CPU inside the DataLoader workers and operates on pre-extracted 1280-dim feature vectors — not on raw pixels. This makes it extremely cheap and introduces no GPU overhead.
The log is written incrementally to results/resource_utilization_log.json so data is preserved even if training is interrupted.
On the A100 MIG partition (9.8 GB VRAM), gpu_reserved_gb will consistently exceed gpu_allocated_gb due to PyTorch’s caching allocator. Call torch.cuda.empty_cache() between ensemble runs to release reserved-but-unused memory back to the driver.