Training Setup

Hardware

GPU

NVIDIA A100 MIG partition — 9.8 GB VRAM

RAM

251 GB system memory

CPU

Intel Xeon Gold

Model Configuration

All four ensemble checkpoints share identical architecture hyperparameters:

{
  "feature_dim": 1280,
  "hidden_dim": 768,
  "num_classes": 4,
  "num_lstm_layers": 4,
  "num_attention_heads": 12,
  "dropout": 0.4,
  "bidirectional": true
}

{
  "learning_rate": 0.001,
  "batch_size": 48,
  "num_epochs": 150,
  "patience": 25
}

Architecture Summary

The SuperEnhancedTemporalModel is a three-stage sequence model:

Input projection — Linear → LayerNorm → ReLU → Dropout(0.2). Projects 1280-dim CNN features to 768-dim hidden space.
4-layer bidirectional LSTM — processes the projected sequence, yielding a 1536-dim output (768 × 2 directions).
Multi-head self-attention — 12 heads over the LSTM output with residual connection and LayerNorm.
Attention pooling — learned scalar weights over the sequence, producing a single 1536-dim vector.
Classifier head — four Linear layers (1536 → 768 → 512 → 256 → 4) with LayerNorm and Dropout between each.

Setting Up the Trainer

from model_train_new import EnhancedTemporalModelTrainer

trainer = EnhancedTemporalModelTrainer(
    features_dir="data/features",
    output_dir="checkpoints",
    device="cuda"
)

Creating Dataloaders

train_loader, val_loader, test_loader = trainer.create_dataloaders(
    batch_size=48,
    num_workers=4,
    feature_file_suffix="_multiscale"  # loads *_features_multiscale.h5
)

create_dataloaders() does the following:

Loads train_features_multiscale.h5, val_features_multiscale.h5, and test_features_multiscale.h5.
Constructs a WeightedRandomSampler for the training split based on inverse class frequency.
Applies training-time augmentation (temporal subsampling, shift, and noise) only to the training dataset. See Optimization for augmentation details.
Uses a custom collate_features function that zero-pads variable-length sequences within a batch.

Training a Single Model

model, results = trainer.train_single_model(
    num_epochs=150,
    batch_size=48,
    learning_rate=1e-3,
    model_name="ensemble_model_1",
    feature_file_suffix="_multiscale",
    results_dir="results"
)

Training Loop

Forward pass

Features and packed sequence lengths are fed to the model. Lengths allow the LSTM to ignore padding via pack_padded_sequence.

Loss computation

FocalLoss(gamma=2.0, smoothing=0.1) is evaluated against integer labels. Per-class alpha weights from the training set are passed at construction time.

Backward pass and gradient clipping

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()

Scheduler step

CosineAnnealingWarmRestarts is stepped every batch (not every epoch).

Validation

Full validation is run after each epoch. Accuracy, weighted F1, and per-class F1 are computed.

Early stopping check

If validation accuracy does not improve for 25 consecutive epochs, training halts.

Ensemble Training

Four independent models are trained with different random seeds to produce the final ensemble:

model, results = trainer.train_ensemble(
    num_models=4,          # default is 3; override to 4 for the production ensemble
    num_epochs=150,
    batch_size=48,
    learning_rate=1e-3,
    feature_file_suffix="_multiscale",
    results_dir="results"
)

Each run uses a different seed (42 + i) to encourage diversity:

for i in range(num_models):
    torch.manual_seed(42 + i)
    np.random.seed(42 + i)
    random.seed(42 + i)
    model, results = self.train_single_model(
        model_name=f'ensemble_model_{i}',
        ...
    )

Ensemble Validation Accuracy

Checkpoint	Best Epoch	Best Val Acc
`best_ensemble_model_1.pt`	52	72.7%
`best_ensemble_model_2.pt`	43	92.1%
`best_ensemble_model_3.pt`	42	91.9%
`best_ensemble_model_4.pt`	40	91.9%

Model 1 converged to a lower accuracy than the others, likely due to the random seed placing it in a poor loss basin. The ensemble averages softmax probabilities across all four models, which smooths this effect.

Checkpoint Strategy

Best model checkpoints are saved whenever validation accuracy improves:

checkpoints/best_ensemble_model_{1-4}.pt

Each checkpoint stores the full training state needed for resumption or analysis:

torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'metrics': val_metrics,
    'train_metrics': train_metrics,
    'history': model_history,
    'best_val_acc': best_val_acc,
    'best_epoch': best_epoch,
    'patience_counter': patience_counter,
    'model_config': { ... },
    'training_config': { ... }
}, output_dir / f'best_{model_name}.pt')

Periodic checkpoints are written every 20 epochs with the same structure, and only the 3 most recent are retained on disk:

checkpoints/ensemble_model_{n}_checkpoint_epoch_{epoch}.pt

To resume training from the latest periodic checkpoint, pass resume_from to train_single_model():

trainer.train_single_model(
    model_name="ensemble_model_2",
    resume_from="checkpoints/ensemble_model_2_checkpoint_epoch_40.pt"
)

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

Hardware

GPU

RAM

CPU

Model Configuration

Architecture Summary

Setting Up the Trainer

Creating Dataloaders

Training a Single Model

Training Loop

Ensemble Training

Ensemble Validation Accuracy

Checkpoint Strategy

Build docs developers (and LLMs) love

Get Started

Concepts

Training Guide

Inference & Deployment

Model Cards

Evaluation

​Hardware

GPU

RAM

CPU

​Model Configuration

​Architecture Summary

​Setting Up the Trainer

​Creating Dataloaders

​Training a Single Model

​Training Loop

​Ensemble Training

​Ensemble Validation Accuracy

​Checkpoint Strategy

Build docs developers (and LLMs) love

Hardware

Model Configuration

Architecture Summary

Setting Up the Trainer

Creating Dataloaders

Training a Single Model

Training Loop

Ensemble Training

Ensemble Validation Accuracy

Checkpoint Strategy