Overview
Hyperparameters control the learning process and significantly impact model performance. This guide covers optimizers, learning rates, schedulers, regularization, and loss functions.Optimizers
Optimizers determine how the model updates its weights during training. Source:app/training/optimizers.py:29-62
Optimizer Comparison
Adam (Recommended Default)
Adam (Recommended Default)
Adaptive Moment EstimationPros:
- Works well out-of-the-box with minimal tuning
- Adapts learning rate per parameter
- Fast convergence
- Good for most deep learning tasks
- Can overfit on small datasets
- May not generalize as well as SGD on some tasks
- Default choice for most experiments
- Transfer learning
- Complex architectures (Transformers)
AdamW (Best for Transformers)
AdamW (Best for Transformers)
Adam with Decoupled Weight DecayPros:
- Better generalization than Adam
- Proper weight decay implementation
- Excellent for transformers and ViT models
- More stable training
- Slightly slower convergence than Adam
- Requires tuning weight decay
- Training Vision Transformers
- When regularization is important
- Large models with many parameters
SGD with Momentum
SGD with Momentum
Stochastic Gradient Descent with MomentumPros:
- Often better final accuracy than Adam
- Better generalization on large datasets
- More predictable behavior
- Standard for ResNet training
- Requires more hyperparameter tuning
- Slower convergence
- Learning rate is critical
- Training CNNs from scratch
- Large datasets (5000+ per class)
- When you need best possible accuracy
- When you have time for LR tuning
RMSprop
RMSprop
Root Mean Square PropagationPros:
- Good for RNNs and some CNNs
- Adapts learning rate per parameter
- Generally outperformed by Adam/AdamW
- Less commonly used in modern architectures
- Rarely needed for malware classification
- Consider if Adam doesn’t work
Quick Reference
| Optimizer | Default LR | Use Case | Training Speed | Final Accuracy |
|---|---|---|---|---|
| Adam | 1e-3 | General purpose | Fast | Good |
| AdamW | 1e-3 | Transformers, regularization | Fast | Better |
| SGD+Momentum | 1e-2 | CNNs from scratch | Slower | Best |
| RMSprop | 1e-3 | Special cases | Medium | Good |
Learning Rate
The learning rate is the most critical hyperparameter. It controls how much to adjust weights during training.Finding the Right Learning Rate
Start with Default
Use recommended default for your optimizer:
- Adam/AdamW: 1e-3 (0.001)
- SGD: 1e-2 (0.01)
Observe Training
Watch the training loss:
- Loss decreases steadily: Learning rate is good
- Loss stays flat: Learning rate too low
- Loss explodes/diverges: Learning rate too high
- Loss oscillates wildly: Learning rate too high
Transfer Learning Learning Rates
Transfer Learning Rule:When fine-tuning pre-trained models, use 10x lower learning rate:
- If you would use 1e-3 for training from scratch
- Use 1e-4 for transfer learning
Learning Rate Ranges by Model Type
| Model Type | Training Mode | Recommended LR (Adam) | Recommended LR (SGD) |
|---|---|---|---|
| Custom CNN | From scratch | 1e-3 to 1e-4 | 1e-2 to 1e-3 |
| Transfer (frozen) | Feature extraction | 1e-3 to 1e-4 | 1e-2 to 1e-3 |
| Transfer (unfrozen) | Fine-tuning | 1e-4 to 1e-5 | 1e-3 to 1e-4 |
| Transformer | From scratch | 3e-4 to 1e-4 | Not recommended |
| Vision Transformer | Fine-tuning | 1e-4 to 1e-5 | Not recommended |
Learning Rate Schedulers
Schedulers adjust the learning rate during training to improve convergence. Source:app/training/optimizers.py:65-102
Scheduler Strategies
Constant (Default)
Constant (Default)
Fixed learning rate throughout trainingWhen to use:
- Short training runs (< 30 epochs)
- Transfer learning with frozen backbone
- When learning rate is already optimal
- Simplest approach, good starting point
ReduceLROnPlateau (Recommended)
ReduceLROnPlateau (Recommended)
Reduces LR when validation loss plateausConfiguration:
factor=0.5: Multiply LR by 0.5 when plateauingpatience=5: Wait 5 epochs before reducingmin_lr=1e-6: Don’t go below this value
- Most training scenarios
- When you want adaptive LR adjustment
- Long training runs (> 50 epochs)
- Automatic adjustment based on performance
- Works well with Adam optimizer
- Good for finding local optima
- Can be slow to react
- May reduce too early on noisy validation
Cosine Annealing
Cosine Annealing
Smoothly decreases LR following cosine curveWhen to use:
- Training from scratch
- Fixed number of epochs known beforehand
- With SGD optimizer
- State-of-the-art training pipelines
- Smooth, gradual decay
- Well-studied in literature
- Often achieves best final accuracy
- Requires knowing total epochs in advance
- Not adaptive to validation performance
Step Decay
Step Decay
Reduces LR by factor at fixed intervalsConfiguration:
step_size=epochs // 3: Drop LR every 1/3 of traininggamma=0.1: Multiply by 0.1 at each step
- Classic CNN training
- When you know training should have distinct phases
- Replicating research papers
- Simple, interpretable
- Works well with SGD
- Requires manual tuning of step_size
- Sudden drops can be disruptive
Exponential Decay
Exponential Decay
Gradually reduces LR by constant factorConfiguration:
gamma=0.95: Multiply by 0.95 every epoch
- Rarely needed in modern training
- Smooth, gradual decay preferred
- Very smooth decay
- Can decay too fast or too slow
- Harder to tune than other methods
Scheduler Usage in Training Loop
Source:app/training/engine.py:231-238
ReduceLROnPlateau is special: it requires validation loss as input. All other schedulers step automatically based on epoch count.
Regularization
Regularization prevents overfitting by constraining model complexity.L2 Regularization (Weight Decay)
Penalizes large weights to prevent overfitting.- Adam: 1e-4 to 1e-5
- AdamW: 0.01 to 0.1 (AdamW handles weight decay differently)
- SGD: 1e-4 to 1e-3
Dropout
Randomly drops units during training to prevent co-adaptation. Configured in model architecture:- Conv layers: 0.25 - 0.3
- Dense layers: 0.5 - 0.7
- Transformers: 0.1 - 0.2
Early Stopping
Stops training when validation performance stops improving. Source:app/training/engine.py:269-275
- Small models: 5-10 epochs
- Large models: 10-20 epochs
- Transfer learning: 10-15 epochs
Loss Functions
The loss function measures how well the model is performing. Source:app/training/optimizers.py:105-122
Loss Function Comparison
Cross-Entropy Loss (Default)
Cross-Entropy Loss (Default)
Standard loss for multi-class classificationFormula:
-log(p_correct_class)When to use:- Balanced datasets
- Standard classification tasks
- Default choice
- Simple, well-understood
- Works well in most scenarios
- Fast computation
- Doesn’t handle class imbalance
Weighted Cross-Entropy
Weighted Cross-Entropy
Cross-entropy with class-specific weightsWhen to use:Pros:
- Imbalanced datasets (some classes have more samples)
- When you want to prioritize rare classes
- Moderate imbalance (up to 1:10 ratio)
- Simple extension of standard CE
- Effective for moderate imbalance
- Easy to interpret
- May not be enough for severe imbalance
- Weights computed automatically may need tuning
Focal Loss
Focal Loss
Addresses class imbalance and hard examplesFormula: When to use:Parameters:
-(1-p)^gamma * log(p)Source: app/training/optimizers.py:8-26- Severe class imbalance (1:100+ ratio)
- Many hard-to-classify examples
- When weighted CE doesn’t work well
gamma=2.0: Focusing parameter (higher = more focus on hard examples)alpha: Class weights (automatically computed from data)
- Excellent for severe imbalance
- Focuses on hard examples
- State-of-the-art for detection tasks
- More complex than CE
- Requires tuning gamma parameter
- Slower convergence sometimes
Loss Function Selection Guide
| Dataset Balance | Class Ratio | Recommended Loss |
|---|---|---|
| Balanced | 1:1 to 1:2 | Cross-Entropy |
| Slight imbalance | 1:2 to 1:5 | Cross-Entropy or Weighted CE |
| Moderate imbalance | 1:5 to 1:10 | Weighted Cross-Entropy |
| Severe imbalance | 1:10+ | Focal Loss |
Batch Size
Batch size affects training speed, memory usage, and generalization.Guidelines
| Batch Size | Memory | Training Speed | Generalization | Use Case |
|---|---|---|---|---|
| 8-16 | Low | Slow | Better | Small GPU, large models |
| 32 | Medium | Good | Good | Default, balanced |
| 64-128 | High | Fast | Decent | Large GPU, small models |
| 256+ | Very High | Very Fast | Worse | Distributed training |
Effects of Batch Size
Small batches (8-16):
- More noisy gradient estimates
- Better generalization
- Slower training
- Lower memory usage
- Better for limited GPU memory
- More stable gradient estimates
- Faster training per epoch
- May generalize worse
- Higher memory usage
- Better GPU utilization
Training Duration (Epochs)
Recommended Epochs by Model Type
| Model Type | Training Mode | Recommended Epochs |
|---|---|---|
| Custom CNN | From scratch | 50-100 |
| Transfer Learning | Frozen backbone | 20-30 |
| Transfer Learning | Fine-tuning | 30-50 |
| Vision Transformer | From scratch | 100-200 |
| Vision Transformer | Fine-tuning | 30-50 |
Use early stopping with patience=10-15 to automatically stop when validation performance plateaus. This is more reliable than fixed epoch counts.
Complete Training Configuration Example
Balanced Dataset, Medium Size (2000/class)
Imbalanced Dataset, Transfer Learning
Small Dataset, Custom CNN
Hyperparameter Tuning Strategy
Train Baseline
Train for 20-30 epochs and observe:
- Is the model learning? (loss decreasing)
- Is it overfitting? (train acc >> val acc)
- Is it underfitting? (both accuracies low)
Adjust Learning Rate
If loss is flat or exploding, adjust LR:
- Flat → increase by 10x
- Exploding → decrease by 10x
Add Regularization
If overfitting:
- Increase dropout
- Enable/increase L2 decay
- Add more data augmentation
Common Issues
Loss Not Decreasing
Loss Not Decreasing
Possible causes:
- Learning rate too low → Increase by 10x
- Learning rate too high → Decrease by 10x
- Model too simple → Try larger architecture
- Data preprocessing issues → Check normalization
Loss Exploding (NaN)
Loss Exploding (NaN)
Possible causes:
- Learning rate way too high
- Gradient explosion
- Reduce learning rate by 10x (try 1e-4)
- Switch from SGD to Adam
- Enable gradient clipping (advanced)
Overfitting (Train >> Val)
Overfitting (Train >> Val)
Solutions:
- Increase dropout to 0.5-0.7
- Enable L2 decay (0.0001-0.001)
- Add more data augmentation
- Reduce model size
- Enable early stopping
- Train for fewer epochs
Validation Loss Plateaus Early
Validation Loss Plateaus Early
Solutions:
- Learning rate may be too high → reduce by 5x
- Try different optimizer (Adam → AdamW)
- Enable learning rate scheduler
- Check if early stopping patience is too low
Next Steps
Model Evaluation
Learn how to evaluate trained models and interpret metrics
Dataset Preparation
Optimize your dataset preparation pipeline