Hyperparameter Tuning

Overview

Hyperparameters control the learning process and significantly impact model performance. This guide covers optimizers, learning rates, schedulers, regularization, and loss functions.

Optimizers

Optimizers determine how the model updates its weights during training. Source: app/training/optimizers.py:29-62

def create_optimizer(model: nn.Module, config: dict) -> torch.optim.Optimizer:
    """Create optimizer from training config."""
    optimizer_name = config.get("optimizer", "Adam")
    lr = config.get("learning_rate", 0.001)
    l2_decay = config.get("l2_decay", False)
    l2_lambda = config.get("l2_lambda", 0.0001) if l2_decay else 0

    if optimizer_name == "Adam":
        return torch.optim.Adam(
            model.parameters(),
            lr=lr,
            weight_decay=l2_lambda,
        )
    elif optimizer_name == "AdamW":
        return torch.optim.AdamW(
            model.parameters(),
            lr=lr,
            weight_decay=l2_lambda if l2_lambda > 0 else 0.01,
        )
    elif optimizer_name == "SGD with Momentum":
        return torch.optim.SGD(
            model.parameters(),
            lr=lr,
            momentum=0.9,
            weight_decay=l2_lambda,
        )
    elif optimizer_name == "RMSprop":
        return torch.optim.RMSprop(
            model.parameters(),
            lr=lr,
            weight_decay=l2_lambda,
        )

Optimizer Comparison

Adam (Recommended Default)

Adaptive Moment EstimationPros:

Works well out-of-the-box with minimal tuning
Adapts learning rate per parameter
Fast convergence
Good for most deep learning tasks

Cons:

Can overfit on small datasets
May not generalize as well as SGD on some tasks

When to use:

Default choice for most experiments
Transfer learning
Complex architectures (Transformers)

Recommended learning rate: 1e-4 to 1e-3

AdamW (Best for Transformers)

Adam with Decoupled Weight DecayPros:

Better generalization than Adam
Proper weight decay implementation
Excellent for transformers and ViT models
More stable training

Cons:

Slightly slower convergence than Adam
Requires tuning weight decay

When to use:

Training Vision Transformers
When regularization is important
Large models with many parameters

Recommended learning rate: 1e-4 to 1e-3Recommended weight decay: 0.01 to 0.1

SGD with Momentum

Stochastic Gradient Descent with MomentumPros:

Often better final accuracy than Adam
Better generalization on large datasets
More predictable behavior
Standard for ResNet training

Cons:

Requires more hyperparameter tuning
Slower convergence
Learning rate is critical

When to use:

Training CNNs from scratch
Large datasets (5000+ per class)
When you need best possible accuracy
When you have time for LR tuning

Recommended learning rate: 1e-2 to 1e-1Momentum: 0.9 (fixed)

RMSprop

Root Mean Square PropagationPros:

Good for RNNs and some CNNs
Adapts learning rate per parameter

Cons:

Generally outperformed by Adam/AdamW
Less commonly used in modern architectures

When to use:

Rarely needed for malware classification
Consider if Adam doesn’t work

Recommended learning rate: 1e-3 to 1e-2

Quick Reference

Optimizer	Default LR	Use Case	Training Speed	Final Accuracy
Adam	1e-3	General purpose	Fast	Good
AdamW	1e-3	Transformers, regularization	Fast	Better
SGD+Momentum	1e-2	CNNs from scratch	Slower	Best
RMSprop	1e-3	Special cases	Medium	Good

Learning Rate

The learning rate is the most critical hyperparameter. It controls how much to adjust weights during training.

Finding the Right Learning Rate

Start with Default

Use recommended default for your optimizer:

Adam/AdamW: 1e-3 (0.001)
SGD: 1e-2 (0.01)

Observe Training

Watch the training loss:

Loss decreases steadily: Learning rate is good
Loss stays flat: Learning rate too low
Loss explodes/diverges: Learning rate too high
Loss oscillates wildly: Learning rate too high

Adjust if Needed

If too low: Multiply by 10 (1e-4 → 1e-3)
If too high: Divide by 10 (1e-3 → 1e-4)

Transfer Learning Learning Rates

Transfer Learning Rule:When fine-tuning pre-trained models, use 10x lower learning rate:

If you would use 1e-3 for training from scratch
Use 1e-4 for transfer learning

This prevents destroying pre-learned features.

Learning Rate Ranges by Model Type

Model Type	Training Mode	Recommended LR (Adam)	Recommended LR (SGD)
Custom CNN	From scratch	1e-3 to 1e-4	1e-2 to 1e-3
Transfer (frozen)	Feature extraction	1e-3 to 1e-4	1e-2 to 1e-3
Transfer (unfrozen)	Fine-tuning	1e-4 to 1e-5	1e-3 to 1e-4
Transformer	From scratch	3e-4 to 1e-4	Not recommended
Vision Transformer	Fine-tuning	1e-4 to 1e-5	Not recommended

Learning Rate Schedulers

Schedulers adjust the learning rate during training to improve convergence. Source: app/training/optimizers.py:65-102

def create_scheduler(
    optimizer: torch.optim.Optimizer,
    config: dict,
    steps_per_epoch: int,
) -> torch.optim.lr_scheduler.LRScheduler | None:
    """Create learning rate scheduler from training config."""
    lr_strategy = config.get("lr_strategy", "Constant")
    epochs = config.get("epochs", 100)

    if lr_strategy == "Constant":
        return None
    elif lr_strategy == "ReduceLROnPlateau":
        return torch.optim.lr_scheduler.ReduceLROnPlateau(
            optimizer,
            mode="min",
            factor=0.5,
            patience=5,
            min_lr=1e-6,
        )
    elif lr_strategy == "Cosine Annealing":
        return torch.optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=epochs,
            eta_min=1e-6,
        )
    elif lr_strategy == "Step Decay":
        return torch.optim.lr_scheduler.StepLR(
            optimizer,
            step_size=epochs // 3,
            gamma=0.1,
        )
    elif lr_strategy == "Exponential Decay":
        return torch.optim.lr_scheduler.ExponentialLR(
            optimizer,
            gamma=0.95,
        )

Scheduler Strategies

Constant (Default)

Fixed learning rate throughout trainingWhen to use:

Short training runs (< 30 epochs)
Transfer learning with frozen backbone
When learning rate is already optimal
Simplest approach, good starting point

Pros: Simple, predictableCons: May not reach optimal performance

ReduceLROnPlateau (Recommended)

Reduces LR when validation loss plateausConfiguration:

factor=0.5: Multiply LR by 0.5 when plateauing
patience=5: Wait 5 epochs before reducing
min_lr=1e-6: Don’t go below this value

When to use:

Most training scenarios
When you want adaptive LR adjustment
Long training runs (> 50 epochs)

Pros:

Automatic adjustment based on performance
Works well with Adam optimizer
Good for finding local optima

Cons:

Can be slow to react
May reduce too early on noisy validation

Cosine Annealing

Smoothly decreases LR following cosine curveWhen to use:

Training from scratch
Fixed number of epochs known beforehand
With SGD optimizer
State-of-the-art training pipelines

Pros:

Smooth, gradual decay
Well-studied in literature
Often achieves best final accuracy

Cons:

Requires knowing total epochs in advance
Not adaptive to validation performance

Step Decay

Reduces LR by factor at fixed intervalsConfiguration:

step_size=epochs // 3: Drop LR every 1/3 of training
gamma=0.1: Multiply by 0.1 at each step

When to use:

Classic CNN training
When you know training should have distinct phases
Replicating research papers

Pros:

Simple, interpretable
Works well with SGD

Cons:

Requires manual tuning of step_size
Sudden drops can be disruptive

Exponential Decay

Gradually reduces LR by constant factorConfiguration:

gamma=0.95: Multiply by 0.95 every epoch

When to use:

Rarely needed in modern training
Smooth, gradual decay preferred

Pros:

Very smooth decay

Cons:

Can decay too fast or too slow
Harder to tune than other methods

Scheduler Usage in Training Loop

Source: app/training/engine.py:231-238

# Scheduler step
if self.scheduler:
    if isinstance(
        self.scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau
    ):
        self.scheduler.step(val_metrics["val_loss"])
    else:
        self.scheduler.step()

ReduceLROnPlateau is special: it requires validation loss as input. All other schedulers step automatically based on epoch count.

Regularization

Regularization prevents overfitting by constraining model complexity.

L2 Regularization (Weight Decay)

Penalizes large weights to prevent overfitting.

training_config = {
    "optimizer": "Adam",
    "learning_rate": 0.001,
    "l2_decay": True,
    "l2_lambda": 0.0001
}

Recommended values:

Adam: 1e-4 to 1e-5
AdamW: 0.01 to 0.1 (AdamW handles weight decay differently)
SGD: 1e-4 to 1e-3

For AdamW, weight decay is decoupled from gradient updates. Use higher values (0.01-0.1) than with standard Adam.

Dropout

Randomly drops units during training to prevent co-adaptation. Configured in model architecture:

{
    "blocks": [
        {"filters": 64, "dropout": 0.25},  # Conv layer dropout
    ],
    "dense_layers": [
        {"units": 256, "dropout": 0.5}    # Dense layer dropout
    ]
}

Recommended values:

Conv layers: 0.25 - 0.3
Dense layers: 0.5 - 0.7
Transformers: 0.1 - 0.2

Early Stopping

Stops training when validation performance stops improving. Source: app/training/engine.py:269-275

# Early stopping check
if self.early_stopping_patience > 0:
    if self.epochs_without_improvement >= self.early_stopping_patience:
        print(
            f"\nEarly stopping triggered after {epoch + 1} epochs (patience: {self.early_stopping_patience})"
        )
        break

Configuration:

training_config = {
    "early_stopping": True,
    "es_patience": 10  # Stop after 10 epochs without improvement
}

Recommended patience:

Small models: 5-10 epochs
Large models: 10-20 epochs
Transfer learning: 10-15 epochs

Loss Functions

The loss function measures how well the model is performing. Source: app/training/optimizers.py:105-122

def create_criterion(
    config: dict,
    class_weights: torch.Tensor | None = None,
    device: torch.device | None = None,
) -> nn.Module:
    """Create loss function from training config."""
    class_weight_method = config.get("class_weights", "None")

    if device and class_weights is not None:
        class_weights = class_weights.to(device)

    if class_weight_method == "Focal Loss":
        return FocalLoss(alpha=class_weights, gamma=2.0)
    elif class_weight_method == "Auto Class Weights" and class_weights is not None:
        return nn.CrossEntropyLoss(weight=class_weights)
    else:
        return nn.CrossEntropyLoss()

Loss Function Comparison

Cross-Entropy Loss (Default)

Standard loss for multi-class classificationFormula: -log(p_correct_class)When to use:

Balanced datasets
Standard classification tasks
Default choice

Pros:

Simple, well-understood
Works well in most scenarios
Fast computation

Cons:

Doesn’t handle class imbalance

Weighted Cross-Entropy

Cross-entropy with class-specific weightsWhen to use:

Imbalanced datasets (some classes have more samples)
When you want to prioritize rare classes
Moderate imbalance (up to 1:10 ratio)

Configuration:

{"class_weights": "Auto Class Weights"}

Pros:

Simple extension of standard CE
Effective for moderate imbalance
Easy to interpret

Cons:

May not be enough for severe imbalance
Weights computed automatically may need tuning

Focal Loss

Addresses class imbalance and hard examplesFormula: -(1-p)^gamma * log(p)Source: app/training/optimizers.py:8-26

class FocalLoss(nn.Module):
    def __init__(self, alpha: torch.Tensor | None = None, gamma: float = 2.0, reduction: str = "mean"):
        super().__init__()
        self.alpha = alpha      # Class weights
        self.gamma = gamma      # Focusing parameter
        self.reduction = reduction

    def forward(self, inputs: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        ce_loss = F.cross_entropy(inputs, targets, weight=self.alpha, reduction="none")
        pt = torch.exp(-ce_loss)
        focal_loss = ((1 - pt) ** self.gamma) * ce_loss

        if self.reduction == "mean":
            return focal_loss.mean()
        return focal_loss

When to use:

Severe class imbalance (1:100+ ratio)
Many hard-to-classify examples
When weighted CE doesn’t work well

Configuration:

{"class_weights": "Focal Loss"}

Parameters:

gamma=2.0: Focusing parameter (higher = more focus on hard examples)
alpha: Class weights (automatically computed from data)

Pros:

Excellent for severe imbalance
Focuses on hard examples
State-of-the-art for detection tasks

Cons:

More complex than CE
Requires tuning gamma parameter
Slower convergence sometimes

Loss Function Selection Guide

Dataset Balance	Class Ratio	Recommended Loss
Balanced	1:1 to 1:2	Cross-Entropy
Slight imbalance	1:2 to 1:5	Cross-Entropy or Weighted CE
Moderate imbalance	1:5 to 1:10	Weighted Cross-Entropy
Severe imbalance	1:10+	Focal Loss

Batch Size

Batch size affects training speed, memory usage, and generalization.

Guidelines

Batch Size	Memory	Training Speed	Generalization	Use Case
8-16	Low	Slow	Better	Small GPU, large models
32	Medium	Good	Good	Default, balanced
64-128	High	Fast	Decent	Large GPU, small models
256+	Very High	Very Fast	Worse	Distributed training

Effects of Batch Size

Small batches (8-16):

More noisy gradient estimates
Better generalization
Slower training
Lower memory usage
Better for limited GPU memory

Large batches (64-128):

More stable gradient estimates
Faster training per epoch
May generalize worse
Higher memory usage
Better GPU utilization

Linear scaling rule: When you increase batch size by N, increase learning rate by N (up to a point).Example:

Batch 32, LR 1e-3
Batch 64 → LR 2e-3
Batch 128 → LR 4e-3 (but cap around 5e-3 for Adam)

Training Duration (Epochs)

Recommended Epochs by Model Type

Model Type	Training Mode	Recommended Epochs
Custom CNN	From scratch	50-100
Transfer Learning	Frozen backbone	20-30
Transfer Learning	Fine-tuning	30-50
Vision Transformer	From scratch	100-200
Vision Transformer	Fine-tuning	30-50

Use early stopping with patience=10-15 to automatically stop when validation performance plateaus. This is more reliable than fixed epoch counts.

Complete Training Configuration Example

Balanced Dataset, Medium Size (2000/class)

training_config = {
    # Optimizer
    "optimizer": "Adam",
    "learning_rate": 0.001,
    
    # Regularization
    "l2_decay": True,
    "l2_lambda": 0.0001,
    
    # Learning rate schedule
    "lr_strategy": "ReduceLROnPlateau",
    
    # Loss function
    "class_weights": "None",
    
    # Training duration
    "epochs": 100,
    "batch_size": 32,
    
    # Early stopping
    "early_stopping": True,
    "es_patience": 10
}

Imbalanced Dataset, Transfer Learning

training_config = {
    # Optimizer (lower LR for fine-tuning)
    "optimizer": "AdamW",
    "learning_rate": 0.0001,
    
    # Regularization
    "l2_decay": True,
    "l2_lambda": 0.01,  # Higher for AdamW
    
    # Learning rate schedule
    "lr_strategy": "Cosine Annealing",
    
    # Loss function (handle imbalance)
    "class_weights": "Focal Loss",
    
    # Training duration
    "epochs": 50,
    "batch_size": 32,
    
    # Early stopping
    "early_stopping": True,
    "es_patience": 15
}

Small Dataset, Custom CNN

training_config = {
    # Optimizer
    "optimizer": "Adam",
    "learning_rate": 0.001,
    
    # Strong regularization (prevent overfitting)
    "l2_decay": True,
    "l2_lambda": 0.001,  # Higher L2
    
    # Learning rate schedule
    "lr_strategy": "ReduceLROnPlateau",
    
    # Loss function
    "class_weights": "Auto Class Weights",
    
    # Training duration (shorter to prevent overfit)
    "epochs": 50,
    "batch_size": 16,  # Smaller batch
    
    # Early stopping (lower patience)
    "early_stopping": True,
    "es_patience": 5
}