Custom Models

Overview

The TikTok Auto Collection Sorter compares three model types during training and selects the best performer via cross-validation:

k-Nearest Neighbors (k-NN): Non-parametric baseline
Logistic Regression: Linear classifier with L2 regularization
Multi-Layer Perceptron (MLP): Two-layer neural network

This guide covers when to use each model, how to modify the MLP architecture, and how to add custom models.

Model Comparison

k-Nearest Neighbors

How it works (train.py:146-152):

k = min(5, len(X_train) - 1)
knn = KNeighborsClassifier(n_neighbors=k, metric="cosine")
knn.fit(X_train, y_train)
knn_preds = knn.predict(X_val)

Characteristics:

No training required (stores all training data)
Uses cosine similarity between embeddings
k=5 neighbors by default

When to use:

Best for small datasets (<100 samples)
When classes have tight, well-separated clusters
When you want instant “training” (no optimization step)

Limitations:

Slow inference on large datasets (compares against all training data)
No learned decision boundaries
Sensitive to noisy features

Logistic Regression

How it works (train.py:155-160):

lr = LogisticRegression(max_iter=1000, C=1.0, class_weight="balanced")
lr.fit(X_train, y_train)
lr_preds = lr.predict(X_val)

Characteristics:

Linear decision boundaries
L2 regularization (C=1.0 controls strength)
Built-in class balancing

When to use:

When classes are linearly separable
For interpretability (can inspect feature weights)
When you need fast, reliable inference

Limitations:

Cannot learn non-linear patterns
May underfit complex relationships

Multi-Layer Perceptron (MLP)

Architecture (train.py:31-45):

class MLP(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),      # 1024 → 256
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2), # 256 → 128
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes), # 128 → N
        )

    def forward(self, x):
        return self.net(x)

Characteristics:

Two hidden layers (256 → 128 neurons)
ReLU activations
Dropout regularization (0.3 and 0.2)
Adam optimizer with weight decay

When to use:

When classes have non-linear decision boundaries
With sufficient training data (>50 samples per class)
When logistic regression underfits

Limitations:

Requires more data than linear models
Slower training than k-NN or logistic regression
Risk of overfitting on very small datasets

Modifying MLP Hyperparameters

Hidden Layer Size

Increase capacity for complex datasets:

class MLP(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_dim=512):  # Was 256
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),       # 1024 → 512
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2),  # 512 → 256
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes), # 256 → N
        )

Larger networks require more training data. If you have <200 labeled samples, stick with hidden_dim=256 or smaller to avoid overfitting.

Dropout Rates

Reduce overfitting by increasing dropout:

nn.Dropout(0.5),  # Was 0.3 - more aggressive regularization

Or decrease for small datasets where model is underfitting:

nn.Dropout(0.1),  # Was 0.3 - less regularization

Learning Rate and Optimizer

Modify train_mlp function (train.py:48-51):

def train_mlp(X_train, y_train, X_val, y_val, num_classes, device, 
              epochs=100, lr=5e-4):  # Was 1e-3
    model = MLP(input_dim, num_classes).to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-3)  # Was 1e-4

Guidelines:

Lower learning rate (5e-4) for more stable training
Higher weight decay (1e-3) for stronger L2 regularization
More epochs (200) if training stops improving early

Batch Size

Change in train.py:64:

loader = DataLoader(train_ds, batch_size=64, shuffle=True)  # Was 32

Larger batches (64) → more stable gradients, faster training
Smaller batches (16) → more noise, better generalization (useful for small datasets)

Adding a Third Hidden Layer

For very complex classification tasks:

class DeepMLP(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),           # 1024 → 256
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim),          # 256 → 256
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2),     # 256 → 128
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes),    # 128 → N
        )

    def forward(self, x):
        return self.net(x)

Replace the MLP class in both train.py and predict.py with DeepMLP.

Deeper networks need significantly more data. Only use 3+ hidden layers if you have >500 labeled samples.

Custom Model: Attention-Based MLP

Add an attention mechanism to weight feature importance:

import torch
import torch.nn as nn
import torch.nn.functional as F

class AttentionMLP(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_dim=256):
        super().__init__()
        # Attention layer
        self.attention = nn.Sequential(
            nn.Linear(input_dim, input_dim),
            nn.Tanh(),
            nn.Linear(input_dim, input_dim),
            nn.Softmax(dim=1)
        )
        
        # Main network
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim // 2, num_classes),
        )

    def forward(self, x):
        # Compute attention weights
        attn_weights = self.attention(x)
        # Apply attention to input features
        x_attended = x * attn_weights
        # Pass through main network
        return self.net(x_attended)

This model learns which features (visual vs. audio) are most important for classification.

Integrating Custom Models

Add model class to train.py
Update training loop in main() function:

# After line 166 in train.py, add:
# 4. Custom Attention MLP
attn_model, attn_acc = train_custom_mlp(
    X_train, y_train, X_val, y_val, num_classes, device
)
attn_preds = attn_model(torch.FloatTensor(X_val).to(device)).argmax(dim=1).cpu().numpy()
results["attention_mlp"].append((attn_preds == y_val).mean())
all_preds["attention_mlp"][val_idx] = attn_preds

Update prediction script (predict.py) to handle new model type
Update model config to save model type metadata

Cross-Validation Strategy

The system uses Stratified K-Fold to ensure balanced folds (train.py:136):

skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

This guarantees each fold has proportional class representation. For custom models, this happens automatically. Key parameters:

n_splits: Adjusted based on smallest class size (min 2, max 5)
shuffle=True: Randomizes data before splitting
random_state=42: Ensures reproducibility

Hyperparameter Tuning Example

Systematic grid search for best MLP configuration:

import itertools

# Define hyperparameter grid
hidden_dims = [128, 256, 512]
dropout_rates = [(0.2, 0.1), (0.3, 0.2), (0.4, 0.3)]
learning_rates = [1e-4, 5e-4, 1e-3]

best_acc = 0
best_config = None

for hidden_dim, (drop1, drop2), lr in itertools.product(
    hidden_dims, dropout_rates, learning_rates
):
    print(f"\nTesting: hidden={hidden_dim}, dropout=({drop1},{drop2}), lr={lr}")
    
    # Modify MLP class with current hyperparameters
    # (you'd need to pass these as arguments to MLP.__init__)
    
    # Run cross-validation
    cv_results = []
    for train_idx, val_idx in skf.split(X, y):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        model, acc = train_mlp(X_train, y_train, X_val, y_val, 
                               num_classes, device, lr=lr)
        cv_results.append(acc)
    
    mean_acc = np.mean(cv_results)
    if mean_acc > best_acc:
        best_acc = mean_acc
        best_config = (hidden_dim, (drop1, drop2), lr)
    
    print(f"Mean CV accuracy: {mean_acc:.1%}")

print(f"\nBest config: {best_config} with {best_acc:.1%} accuracy")

Hyperparameter tuning requires many training runs. Each configuration multiplied by K folds can take 10-20 minutes on CPU. Consider using a GPU or reducing the search space.

Model Selection Insights

From train.py:176-179, the system automatically picks the best model:

mean_accs = {name: np.mean(accs) for name, accs in results.items()}
best_name = max(mean_accs, key=mean_accs.get)
print(f"\nBest model: {best_name} ({mean_accs[best_name]:.1%})")

Typical outcomes:

k-NN wins: Very small dataset (<50 samples) or highly clustered embeddings
Logistic Regression wins: Linearly separable classes, medium dataset (50-200 samples)
MLP wins: Complex boundaries, sufficient data (>200 samples), multimodal signals

Debugging Poor Performance

If all models perform poorly (<70% accuracy):

Check feature quality: Visualize embeddings with t-SNE/UMAP
Verify labels: Ensure folder assignments are consistent
Increase data: Collect more labeled samples per class
Adjust class weights: See Class Imbalance
Try different architectures: Add/remove layers, change activations

Class Imbalance

Handle skewed class distributions

Active Learning

Efficiently collect training data

Get Started

Core Concepts

Guides

Advanced

Overview

Model Comparison

k-Nearest Neighbors

Logistic Regression

Multi-Layer Perceptron (MLP)

Modifying MLP Hyperparameters

Hidden Layer Size

Dropout Rates

Learning Rate and Optimizer

Batch Size

Adding a Third Hidden Layer

Custom Model: Attention-Based MLP

Integrating Custom Models

Cross-Validation Strategy

Hyperparameter Tuning Example

Model Selection Insights

Debugging Poor Performance

Class Imbalance

Active Learning

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Overview

​Model Comparison

​k-Nearest Neighbors

​Logistic Regression

​Multi-Layer Perceptron (MLP)

​Modifying MLP Hyperparameters

​Hidden Layer Size

​Dropout Rates

​Learning Rate and Optimizer

​Batch Size

​Adding a Third Hidden Layer

​Custom Model: Attention-Based MLP

​Integrating Custom Models

​Cross-Validation Strategy

​Hyperparameter Tuning Example

​Model Selection Insights

​Debugging Poor Performance

​Related Topics

Class Imbalance

Active Learning

Build docs developers (and LLMs) love

Overview

Model Comparison

k-Nearest Neighbors

Logistic Regression

Multi-Layer Perceptron (MLP)

Modifying MLP Hyperparameters

Hidden Layer Size

Dropout Rates

Learning Rate and Optimizer

Batch Size

Adding a Third Hidden Layer

Custom Model: Attention-Based MLP

Integrating Custom Models

Cross-Validation Strategy

Hyperparameter Tuning Example

Model Selection Insights

Debugging Poor Performance

Related Topics