Skip to main content

Overview

Hyperparameters control the learning process and significantly impact model performance. This guide covers optimizers, learning rates, schedulers, regularization, and loss functions.

Optimizers

Optimizers determine how the model updates its weights during training. Source: app/training/optimizers.py:29-62
def create_optimizer(model: nn.Module, config: dict) -> torch.optim.Optimizer:
    """Create optimizer from training config."""
    optimizer_name = config.get("optimizer", "Adam")
    lr = config.get("learning_rate", 0.001)
    l2_decay = config.get("l2_decay", False)
    l2_lambda = config.get("l2_lambda", 0.0001) if l2_decay else 0

    if optimizer_name == "Adam":
        return torch.optim.Adam(
            model.parameters(),
            lr=lr,
            weight_decay=l2_lambda,
        )
    elif optimizer_name == "AdamW":
        return torch.optim.AdamW(
            model.parameters(),
            lr=lr,
            weight_decay=l2_lambda if l2_lambda > 0 else 0.01,
        )
    elif optimizer_name == "SGD with Momentum":
        return torch.optim.SGD(
            model.parameters(),
            lr=lr,
            momentum=0.9,
            weight_decay=l2_lambda,
        )
    elif optimizer_name == "RMSprop":
        return torch.optim.RMSprop(
            model.parameters(),
            lr=lr,
            weight_decay=l2_lambda,
        )

Optimizer Comparison

Adam with Decoupled Weight DecayPros:
  • Better generalization than Adam
  • Proper weight decay implementation
  • Excellent for transformers and ViT models
  • More stable training
Cons:
  • Slightly slower convergence than Adam
  • Requires tuning weight decay
When to use:
  • Training Vision Transformers
  • When regularization is important
  • Large models with many parameters
Recommended learning rate: 1e-4 to 1e-3Recommended weight decay: 0.01 to 0.1
Stochastic Gradient Descent with MomentumPros:
  • Often better final accuracy than Adam
  • Better generalization on large datasets
  • More predictable behavior
  • Standard for ResNet training
Cons:
  • Requires more hyperparameter tuning
  • Slower convergence
  • Learning rate is critical
When to use:
  • Training CNNs from scratch
  • Large datasets (5000+ per class)
  • When you need best possible accuracy
  • When you have time for LR tuning
Recommended learning rate: 1e-2 to 1e-1Momentum: 0.9 (fixed)
Root Mean Square PropagationPros:
  • Good for RNNs and some CNNs
  • Adapts learning rate per parameter
Cons:
  • Generally outperformed by Adam/AdamW
  • Less commonly used in modern architectures
When to use:
  • Rarely needed for malware classification
  • Consider if Adam doesn’t work
Recommended learning rate: 1e-3 to 1e-2

Quick Reference

OptimizerDefault LRUse CaseTraining SpeedFinal Accuracy
Adam1e-3General purposeFastGood
AdamW1e-3Transformers, regularizationFastBetter
SGD+Momentum1e-2CNNs from scratchSlowerBest
RMSprop1e-3Special casesMediumGood

Learning Rate

The learning rate is the most critical hyperparameter. It controls how much to adjust weights during training.

Finding the Right Learning Rate

1

Start with Default

Use recommended default for your optimizer:
  • Adam/AdamW: 1e-3 (0.001)
  • SGD: 1e-2 (0.01)
2

Observe Training

Watch the training loss:
  • Loss decreases steadily: Learning rate is good
  • Loss stays flat: Learning rate too low
  • Loss explodes/diverges: Learning rate too high
  • Loss oscillates wildly: Learning rate too high
3

Adjust if Needed

  • If too low: Multiply by 10 (1e-4 → 1e-3)
  • If too high: Divide by 10 (1e-3 → 1e-4)

Transfer Learning Learning Rates

Transfer Learning Rule:When fine-tuning pre-trained models, use 10x lower learning rate:
  • If you would use 1e-3 for training from scratch
  • Use 1e-4 for transfer learning
This prevents destroying pre-learned features.

Learning Rate Ranges by Model Type

Model TypeTraining ModeRecommended LR (Adam)Recommended LR (SGD)
Custom CNNFrom scratch1e-3 to 1e-41e-2 to 1e-3
Transfer (frozen)Feature extraction1e-3 to 1e-41e-2 to 1e-3
Transfer (unfrozen)Fine-tuning1e-4 to 1e-51e-3 to 1e-4
TransformerFrom scratch3e-4 to 1e-4Not recommended
Vision TransformerFine-tuning1e-4 to 1e-5Not recommended

Learning Rate Schedulers

Schedulers adjust the learning rate during training to improve convergence. Source: app/training/optimizers.py:65-102
def create_scheduler(
    optimizer: torch.optim.Optimizer,
    config: dict,
    steps_per_epoch: int,
) -> torch.optim.lr_scheduler.LRScheduler | None:
    """Create learning rate scheduler from training config."""
    lr_strategy = config.get("lr_strategy", "Constant")
    epochs = config.get("epochs", 100)

    if lr_strategy == "Constant":
        return None
    elif lr_strategy == "ReduceLROnPlateau":
        return torch.optim.lr_scheduler.ReduceLROnPlateau(
            optimizer,
            mode="min",
            factor=0.5,
            patience=5,
            min_lr=1e-6,
        )
    elif lr_strategy == "Cosine Annealing":
        return torch.optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=epochs,
            eta_min=1e-6,
        )
    elif lr_strategy == "Step Decay":
        return torch.optim.lr_scheduler.StepLR(
            optimizer,
            step_size=epochs // 3,
            gamma=0.1,
        )
    elif lr_strategy == "Exponential Decay":
        return torch.optim.lr_scheduler.ExponentialLR(
            optimizer,
            gamma=0.95,
        )

Scheduler Strategies

Fixed learning rate throughout trainingWhen to use:
  • Short training runs (< 30 epochs)
  • Transfer learning with frozen backbone
  • When learning rate is already optimal
  • Simplest approach, good starting point
Pros: Simple, predictableCons: May not reach optimal performance
Smoothly decreases LR following cosine curveWhen to use:
  • Training from scratch
  • Fixed number of epochs known beforehand
  • With SGD optimizer
  • State-of-the-art training pipelines
Pros:
  • Smooth, gradual decay
  • Well-studied in literature
  • Often achieves best final accuracy
Cons:
  • Requires knowing total epochs in advance
  • Not adaptive to validation performance
Reduces LR by factor at fixed intervalsConfiguration:
  • step_size=epochs // 3: Drop LR every 1/3 of training
  • gamma=0.1: Multiply by 0.1 at each step
When to use:
  • Classic CNN training
  • When you know training should have distinct phases
  • Replicating research papers
Pros:
  • Simple, interpretable
  • Works well with SGD
Cons:
  • Requires manual tuning of step_size
  • Sudden drops can be disruptive
Gradually reduces LR by constant factorConfiguration:
  • gamma=0.95: Multiply by 0.95 every epoch
When to use:
  • Rarely needed in modern training
  • Smooth, gradual decay preferred
Pros:
  • Very smooth decay
Cons:
  • Can decay too fast or too slow
  • Harder to tune than other methods

Scheduler Usage in Training Loop

Source: app/training/engine.py:231-238
# Scheduler step
if self.scheduler:
    if isinstance(
        self.scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau
    ):
        self.scheduler.step(val_metrics["val_loss"])
    else:
        self.scheduler.step()
ReduceLROnPlateau is special: it requires validation loss as input. All other schedulers step automatically based on epoch count.

Regularization

Regularization prevents overfitting by constraining model complexity.

L2 Regularization (Weight Decay)

Penalizes large weights to prevent overfitting.
training_config = {
    "optimizer": "Adam",
    "learning_rate": 0.001,
    "l2_decay": True,
    "l2_lambda": 0.0001
}
Recommended values:
  • Adam: 1e-4 to 1e-5
  • AdamW: 0.01 to 0.1 (AdamW handles weight decay differently)
  • SGD: 1e-4 to 1e-3
For AdamW, weight decay is decoupled from gradient updates. Use higher values (0.01-0.1) than with standard Adam.

Dropout

Randomly drops units during training to prevent co-adaptation. Configured in model architecture:
{
    "blocks": [
        {"filters": 64, "dropout": 0.25},  # Conv layer dropout
    ],
    "dense_layers": [
        {"units": 256, "dropout": 0.5}    # Dense layer dropout
    ]
}
Recommended values:
  • Conv layers: 0.25 - 0.3
  • Dense layers: 0.5 - 0.7
  • Transformers: 0.1 - 0.2

Early Stopping

Stops training when validation performance stops improving. Source: app/training/engine.py:269-275
# Early stopping check
if self.early_stopping_patience > 0:
    if self.epochs_without_improvement >= self.early_stopping_patience:
        print(
            f"\nEarly stopping triggered after {epoch + 1} epochs (patience: {self.early_stopping_patience})"
        )
        break
Configuration:
training_config = {
    "early_stopping": True,
    "es_patience": 10  # Stop after 10 epochs without improvement
}
Recommended patience:
  • Small models: 5-10 epochs
  • Large models: 10-20 epochs
  • Transfer learning: 10-15 epochs

Loss Functions

The loss function measures how well the model is performing. Source: app/training/optimizers.py:105-122
def create_criterion(
    config: dict,
    class_weights: torch.Tensor | None = None,
    device: torch.device | None = None,
) -> nn.Module:
    """Create loss function from training config."""
    class_weight_method = config.get("class_weights", "None")

    if device and class_weights is not None:
        class_weights = class_weights.to(device)

    if class_weight_method == "Focal Loss":
        return FocalLoss(alpha=class_weights, gamma=2.0)
    elif class_weight_method == "Auto Class Weights" and class_weights is not None:
        return nn.CrossEntropyLoss(weight=class_weights)
    else:
        return nn.CrossEntropyLoss()

Loss Function Comparison

Standard loss for multi-class classificationFormula: -log(p_correct_class)When to use:
  • Balanced datasets
  • Standard classification tasks
  • Default choice
Pros:
  • Simple, well-understood
  • Works well in most scenarios
  • Fast computation
Cons:
  • Doesn’t handle class imbalance
Cross-entropy with class-specific weightsWhen to use:
  • Imbalanced datasets (some classes have more samples)
  • When you want to prioritize rare classes
  • Moderate imbalance (up to 1:10 ratio)
Configuration:
{"class_weights": "Auto Class Weights"}
Pros:
  • Simple extension of standard CE
  • Effective for moderate imbalance
  • Easy to interpret
Cons:
  • May not be enough for severe imbalance
  • Weights computed automatically may need tuning
Addresses class imbalance and hard examplesFormula: -(1-p)^gamma * log(p)Source: app/training/optimizers.py:8-26
class FocalLoss(nn.Module):
    def __init__(self, alpha: torch.Tensor | None = None, gamma: float = 2.0, reduction: str = "mean"):
        super().__init__()
        self.alpha = alpha      # Class weights
        self.gamma = gamma      # Focusing parameter
        self.reduction = reduction

    def forward(self, inputs: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        ce_loss = F.cross_entropy(inputs, targets, weight=self.alpha, reduction="none")
        pt = torch.exp(-ce_loss)
        focal_loss = ((1 - pt) ** self.gamma) * ce_loss

        if self.reduction == "mean":
            return focal_loss.mean()
        return focal_loss
When to use:
  • Severe class imbalance (1:100+ ratio)
  • Many hard-to-classify examples
  • When weighted CE doesn’t work well
Configuration:
{"class_weights": "Focal Loss"}
Parameters:
  • gamma=2.0: Focusing parameter (higher = more focus on hard examples)
  • alpha: Class weights (automatically computed from data)
Pros:
  • Excellent for severe imbalance
  • Focuses on hard examples
  • State-of-the-art for detection tasks
Cons:
  • More complex than CE
  • Requires tuning gamma parameter
  • Slower convergence sometimes

Loss Function Selection Guide

Dataset BalanceClass RatioRecommended Loss
Balanced1:1 to 1:2Cross-Entropy
Slight imbalance1:2 to 1:5Cross-Entropy or Weighted CE
Moderate imbalance1:5 to 1:10Weighted Cross-Entropy
Severe imbalance1:10+Focal Loss

Batch Size

Batch size affects training speed, memory usage, and generalization.

Guidelines

Batch SizeMemoryTraining SpeedGeneralizationUse Case
8-16LowSlowBetterSmall GPU, large models
32MediumGoodGoodDefault, balanced
64-128HighFastDecentLarge GPU, small models
256+Very HighVery FastWorseDistributed training

Effects of Batch Size

Small batches (8-16):
  • More noisy gradient estimates
  • Better generalization
  • Slower training
  • Lower memory usage
  • Better for limited GPU memory
Large batches (64-128):
  • More stable gradient estimates
  • Faster training per epoch
  • May generalize worse
  • Higher memory usage
  • Better GPU utilization
Linear scaling rule: When you increase batch size by N, increase learning rate by N (up to a point).Example:
  • Batch 32, LR 1e-3
  • Batch 64 → LR 2e-3
  • Batch 128 → LR 4e-3 (but cap around 5e-3 for Adam)

Training Duration (Epochs)

Model TypeTraining ModeRecommended Epochs
Custom CNNFrom scratch50-100
Transfer LearningFrozen backbone20-30
Transfer LearningFine-tuning30-50
Vision TransformerFrom scratch100-200
Vision TransformerFine-tuning30-50
Use early stopping with patience=10-15 to automatically stop when validation performance plateaus. This is more reliable than fixed epoch counts.

Complete Training Configuration Example

Balanced Dataset, Medium Size (2000/class)

training_config = {
    # Optimizer
    "optimizer": "Adam",
    "learning_rate": 0.001,
    
    # Regularization
    "l2_decay": True,
    "l2_lambda": 0.0001,
    
    # Learning rate schedule
    "lr_strategy": "ReduceLROnPlateau",
    
    # Loss function
    "class_weights": "None",
    
    # Training duration
    "epochs": 100,
    "batch_size": 32,
    
    # Early stopping
    "early_stopping": True,
    "es_patience": 10
}

Imbalanced Dataset, Transfer Learning

training_config = {
    # Optimizer (lower LR for fine-tuning)
    "optimizer": "AdamW",
    "learning_rate": 0.0001,
    
    # Regularization
    "l2_decay": True,
    "l2_lambda": 0.01,  # Higher for AdamW
    
    # Learning rate schedule
    "lr_strategy": "Cosine Annealing",
    
    # Loss function (handle imbalance)
    "class_weights": "Focal Loss",
    
    # Training duration
    "epochs": 50,
    "batch_size": 32,
    
    # Early stopping
    "early_stopping": True,
    "es_patience": 15
}

Small Dataset, Custom CNN

training_config = {
    # Optimizer
    "optimizer": "Adam",
    "learning_rate": 0.001,
    
    # Strong regularization (prevent overfitting)
    "l2_decay": True,
    "l2_lambda": 0.001,  # Higher L2
    
    # Learning rate schedule
    "lr_strategy": "ReduceLROnPlateau",
    
    # Loss function
    "class_weights": "Auto Class Weights",
    
    # Training duration (shorter to prevent overfit)
    "epochs": 50,
    "batch_size": 16,  # Smaller batch
    
    # Early stopping (lower patience)
    "early_stopping": True,
    "es_patience": 5
}

Hyperparameter Tuning Strategy

1

Start with Defaults

Use recommended defaults from this guide
2

Train Baseline

Train for 20-30 epochs and observe:
  • Is the model learning? (loss decreasing)
  • Is it overfitting? (train acc >> val acc)
  • Is it underfitting? (both accuracies low)
3

Adjust Learning Rate

If loss is flat or exploding, adjust LR:
  • Flat → increase by 10x
  • Exploding → decrease by 10x
4

Add Regularization

If overfitting:
  • Increase dropout
  • Enable/increase L2 decay
  • Add more data augmentation
5

Tune Scheduler

Try ReduceLROnPlateau if using Constant
6

Fine-tune Loss

If imbalanced, try Weighted CE or Focal Loss

Common Issues

Possible causes:
  • Learning rate too low → Increase by 10x
  • Learning rate too high → Decrease by 10x
  • Model too simple → Try larger architecture
  • Data preprocessing issues → Check normalization
Quick fix: Try LR = 1e-3 with Adam
Possible causes:
  • Learning rate way too high
  • Gradient explosion
Solutions:
  • Reduce learning rate by 10x (try 1e-4)
  • Switch from SGD to Adam
  • Enable gradient clipping (advanced)
Solutions:
  • Increase dropout to 0.5-0.7
  • Enable L2 decay (0.0001-0.001)
  • Add more data augmentation
  • Reduce model size
  • Enable early stopping
  • Train for fewer epochs
Solutions:
  • Learning rate may be too high → reduce by 5x
  • Try different optimizer (Adam → AdamW)
  • Enable learning rate scheduler
  • Check if early stopping patience is too low

Next Steps

Model Evaluation

Learn how to evaluate trained models and interpret metrics

Dataset Preparation

Optimize your dataset preparation pipeline

Build docs developers (and LLMs) love