Skip to main content

Overview

The training pipeline evaluates three different classifier architectures using stratified cross-validation, then selects and retrains the best performer on the full dataset. The system handles class imbalance through weighted loss functions and uses early stopping to prevent overfitting.

Model Architectures

1. k-Nearest Neighbors (Baseline)

Hyperparameters:
  • n_neighbors: min(5, len(X_train) - 1)
  • metric: "cosine" - Cosine similarity (well-suited for normalized embeddings)
Why k-NN?k-NN serves as a strong baseline for this task because:
  • No training required (inference-only)
  • Works well with high-quality pre-trained embeddings (CLIP)
  • Cosine distance naturally handles normalized feature vectors
  • Simple and interpretable
From train.py:145-152:
k = min(5, len(X_train) - 1)
knn = KNeighborsClassifier(n_neighbors=k, metric="cosine")
knn.fit(X_train, y_train)
knn_preds = knn.predict(X_val)

2. Logistic Regression

Hyperparameters:
  • max_iter: 1000
  • C: 1.0 (inverse regularization strength)
  • class_weight: "balanced" - Automatically adjusts weights inversely proportional to class frequencies
Logistic regression provides a linear decision boundary in the 1024-d embedding space. With class_weight="balanced", it automatically handles class imbalance by weighting the loss for each sample by the inverse of its class frequency.
From train.py:154-160:
lr = LogisticRegression(max_iter=1000, C=1.0, class_weight="balanced")
lr.fit(X_train, y_train)
lr_preds = lr.predict(X_val)

3. Multi-Layer Perceptron (MLP)

Architecture:
Input (1024-d) 

 Linear(1024 → 256)

 ReLU

 Dropout(p=0.3)

 Linear(256 → 128)

 ReLU

 Dropout(p=0.2)

 Linear(128 → num_classes)

Logits
Hyperparameters:
  • hidden_dim: 256 (first hidden layer)
  • hidden_dim // 2: 128 (second hidden layer)
  • dropout: 0.3 (first layer), 0.2 (second layer)
  • learning_rate: 1e-3
  • weight_decay: 1e-4 (L2 regularization)
  • batch_size: 32
  • epochs: 100 (with early stopping)
From train.py:31-45:
class MLP(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes),
        )
    
    def forward(self, x):
        return self.net(x)
Architecture Rationale:
  • 2 hidden layers: Sufficient capacity for non-linear decision boundaries without overfitting
  • 256 → 128 bottleneck: Progressively reduces dimensionality while learning hierarchical features
  • Dropout regularization: Prevents co-adaptation of neurons, improves generalization
  • ReLU activation: Fast, effective, and prevents vanishing gradients

Class Imbalance Handling

The Problem

User-organized collections often have severe class imbalance. For example:
class_distribution = {
    "soccer": 82,   # Dominant class
    "funny": 5,     # Underrepresented
    "cooking": 15,
    "travel": 23
}
Without special handling, the model would learn to always predict “soccer” for high accuracy, ignoring minority classes.

Solution: Class-Weighted Loss

From train.py:53-58:
# Class-weighted loss to handle imbalance (soccer=82 vs funny=5)
class_counts = np.bincount(y_train, minlength=num_classes).astype(float)
class_counts = np.maximum(class_counts, 1.0)  # avoid div by zero
weights = 1.0 / class_counts
weights = weights / weights.sum() * num_classes  # normalize
criterion = nn.CrossEntropyLoss(weight=torch.FloatTensor(weights).to(device))
How It Works:
  1. Count samples per class: [82, 5, 15, 23]
  2. Compute inverse frequencies: [1/82, 1/5, 1/15, 1/23]
  3. Normalize weights to sum to num_classes
  4. Apply weights to cross-entropy loss
Effect: The loss for a minority class sample is amplified proportionally to its rarity, forcing the model to pay attention to underrepresented categories.For example, a misclassified “funny” sample (5 examples) contributes ~16x more loss than a misclassified “soccer” sample (82 examples).

Training Procedure (MLP)

Optimizer and Regularization

From train.py:51:
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
  • Adam: Adaptive learning rate optimizer (fast convergence, robust to hyperparameters)
  • Learning Rate: 1e-3 (standard default)
  • Weight Decay: 1e-4 (L2 regularization to prevent overfitting)

Early Stopping

From train.py:66-93:
best_val_acc = 0
best_state = None
patience = 15
no_improve = 0

for epoch in range(epochs):
    model.train()
    for xb, yb in loader:
        optimizer.zero_grad()
        loss = criterion(model(xb), yb)
        loss.backward()
        optimizer.step()
    
    # Validate
    model.eval()
    with torch.no_grad():
        val_logits = model(torch.FloatTensor(X_val).to(device))
        val_preds = val_logits.argmax(dim=1).cpu().numpy()
        val_acc = (val_preds == y_val).mean()
    
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
        no_improve = 0
    else:
        no_improve += 1
        if no_improve >= patience:
            break  # Stop if no improvement for 15 epochs
Early stopping with patience=15 prevents overfitting by monitoring validation accuracy. If accuracy doesn’t improve for 15 consecutive epochs, training halts and the best checkpoint is restored.

Cross-Validation Strategy

Stratified K-Fold

From train.py:131-136:
# Can't have more splits than smallest class
n_splits = min(5, min(np.bincount(y)))  
n_splits = max(2, n_splits)
print(f"Using {n_splits}-fold stratified cross-validation")

skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
Stratified splitting ensures each fold maintains the same class distribution as the full dataset. This is crucial for imbalanced data to get reliable accuracy estimates.For example, if “funny” is 4% of the dataset, it will be ~4% in each train/validation split.
Adaptive K Selection:
  • If the smallest class has only 5 examples, we can’t do 5-fold CV (would leave 1 example per fold)
  • The system automatically reduces n_splits to ensure at least 1 sample per class in each split
  • Minimum of 2 folds, maximum of 5 folds

Cross-Validation Loop

From train.py:141-168:
results = {"knn": [], "logreg": [], "mlp": []}

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    
    # Train all three models
    knn = KNeighborsClassifier(n_neighbors=k, metric="cosine")
    knn.fit(X_train, y_train)
    results["knn"].append((knn.predict(X_val) == y_val).mean())
    
    lr = LogisticRegression(max_iter=1000, C=1.0, class_weight="balanced")
    lr.fit(X_train, y_train)
    results["logreg"].append((lr.predict(X_val) == y_val).mean())
    
    mlp_model, mlp_acc = train_mlp(X_train, y_train, X_val, y_val, num_classes, device)
    results["mlp"].append(mlp_acc)
    
    print(f"Fold {fold+1}: kNN={...}  LogReg={...}  MLP={...}")
Output Example:
Fold 1: kNN=75.0%  LogReg=80.0%  MLP=85.0%
Fold 2: kNN=72.5%  LogReg=77.5%  MLP=82.5%
Fold 3: kNN=77.5%  LogReg=82.5%  MLP=87.5%
Fold 4: kNN=70.0%  LogReg=75.0%  MLP=80.0%
Fold 5: kNN=75.0%  LogReg=80.0%  MLP=85.0%

Cross-validation results (mean accuracy):
      kNN: 74.0% (+/- 2.8%)
   LogReg: 79.0% (+/- 3.0%)
      MLP: 84.0% (+/- 3.2%)

Best model: MLP (84.0%)

Model Selection

From train.py:176-182:
# Pick best model type
mean_accs = {name: np.mean(accs) for name, accs in results.items()}
best_name = max(mean_accs, key=mean_accs.get)
print(f"Best model: {best_name} ({mean_accs[best_name]:.1%})")

# Detailed report for best
evaluate(f"Best Model ({best_name}) - Full CV Predictions", y, all_preds[best_name], label_names)
The model with the highest mean cross-validation accuracy is selected for final retraining.

Final Model Retraining

After selecting the best architecture via CV, the system retrains on the full dataset to maximize available training data for deployment.

For MLP (Most Common Winner)

From train.py:205-218:
if best_name == "mlp":
    # Train on 90% of data, hold out 10% for early stopping
    split = int(0.9 * len(X))
    perm = np.random.RandomState(42).permutation(len(X))
    X_t, X_v = X[perm[:split]], X[perm[split:]]
    y_t, y_v = y[perm[:split]], y[perm[split:]]
    
    final_model, _ = train_mlp(X_t, y_t, X_v, y_v, num_classes, device, epochs=200)
    
    torch.save(final_model.state_dict(), ARTIFACTS_DIR / "model.pt")
    config = {
        "model_type": "mlp",
        "input_dim": int(X.shape[1]),     # 1024
        "num_classes": num_classes,
        "hidden_dim": 256,
    }
Key Points:
  • Uses 90% for training, 10% holdout for early stopping validation
  • Trains for up to 200 epochs (more than CV’s 100, since this is final model)
  • Saves PyTorch state dict to model.pt

For sklearn Models (k-NN or LogReg)

From train.py:187-203:
if best_name == "knn":
    k = min(5, len(X) - 1)
    final_model = KNeighborsClassifier(n_neighbors=k, metric="cosine")
    final_model.fit(X, y)
    
    import pickle
    with open(ARTIFACTS_DIR / "model.pkl", "wb") as f:
        pickle.dump(final_model, f)
    config = {"model_type": "knn", "k": k}

elif best_name == "logreg":
    final_model = LogisticRegression(max_iter=1000, C=1.0, class_weight="balanced")
    final_model.fit(X, y)
    
    import pickle
    with open(ARTIFACTS_DIR / "model.pkl", "wb") as f:
        pickle.dump(final_model, f)
    config = {"model_type": "logreg"}
  • sklearn models train on 100% of data (no need for validation during fit)
  • Saved as pickle files (model.pkl) instead of PyTorch

Model Configuration

From train.py:220-225:
config["label_names"] = label_names
config["feature_dim"] = int(X.shape[1])           # 1024
config["best_cv_accuracy"] = float(mean_accs[best_name])

with open(ARTIFACTS_DIR / "model_config.json", "w") as f:
    json.dump(config, f, indent=2)
Example model_config.json:
{
  "model_type": "mlp",
  "input_dim": 1024,
  "num_classes": 4,
  "hidden_dim": 256,
  "label_names": ["cooking", "funny", "soccer", "travel"],
  "feature_dim": 1024,
  "best_cv_accuracy": 0.84
}
This config is used by predict.py to reconstruct the model architecture and interpret predictions.

Evaluation Metrics

From train.py:99-114:
def evaluate(name, y_true, y_pred, label_names):
    print(f"\n{'='*60}")
    print(f"  {name}")
    print(f"{'='*60}")
    print(classification_report(y_true, y_pred, target_names=label_names, zero_division=0))
    cm = confusion_matrix(y_true, y_pred)
    print("Confusion Matrix:")
    # Print formatted confusion matrix
    ...
    acc = (y_true == y_pred).mean()
    print(f"\nOverall accuracy: {acc:.1%}")
    return acc
Metrics Reported:
  1. Per-class precision, recall, F1-score (from sklearn’s classification_report)
  2. Confusion matrix (which classes are confused with each other)
  3. Overall accuracy
Confusion Matrix is especially useful for debugging class imbalance issues. If “funny” videos are always misclassified as “soccer,” the confusion matrix will show a high off-diagonal value in that position.

Training Summary Workflow

Performance Expectations

Typical Results (will vary by dataset):
  • k-NN: 70-80% accuracy (good baseline, fast inference)
  • Logistic Regression: 75-85% accuracy (linear boundary, class weighting helps)
  • MLP: 80-90% accuracy (best performer, learns non-linear patterns)
Performance depends heavily on:
  1. Dataset size (more data → better accuracy)
  2. Class balance (even distribution → easier learning)
  3. Category separability (visually distinct categories → higher accuracy)

Next Steps

Build docs developers (and LLMs) love